Stream Computing on Graphics Hardware

 BY:Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan
Computer Science Department
Stanford University

To appear at SIGGRAPH 2004

Abstract

In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to hand-written GPU code and up to seven times faster than their CPU counterparts.

Paper http://graphics.stanford.edu/papers/brookgpu/brookgpu.pdf

Presentation http://graphics.stanford.edu/papers/brookgpu/buck.Brook.pdf

 

 

Loading mentions Retweet
Filed under  //  HIVE   research   stream computing  
Comments (0)
Posted 3 months ago

Stream Computing FAQ

What is stream computing?
Stream computing (or stream processing) refers to a class of compute problems, applications or tasks that can be broken down into parallel, identical operations and run simultaneously on a single processor device. These parallel data streams entering the processor device, computations taking place and the output from the device define stream computing.

Today, stream computing is primarily the realm of the graphics processor unit (GPU) where the parallel processes used to produce graphics imagery are used instead to perform arithmetic calculations.

Characteristics of stream computing:

  • Enable new applications on new architectures
  • Parallel problems other than graphics that map well on GPU architecture
  • Transition from fixed function to programmable pipelines
  • Various proof points in research and industry under the name GPGPU

How does stream computing differ from computation on the CPU?
Stream computing takes advantage of a SIMD methodology (single instruction, multiple data) whereas a CPU is a modified SISD methodology (single instruction, single data); modifications taking various parallelism techniques into account.

The benefit of stream computing stems from the highly parallel architecture of the GPU whereby tens to hundreds of parallel operations are performed with each clock cycle whereas the CPU can at best work only a small handful of parallel operations per clock cycle.

What are AMD's stream computing product features?
AMD's FireStream™ 9170, our latest generation stream computing GPU, features:

  • 320 stream cores (compute units or ALUs)
  • 2GB on-board GDDR3 memory
  • Double precision floating point support
  • PCIe 2.0 x16 interface

View AMD FireStream 9170 specifications.

What are AMD's stream computing product advantages?
AMD's FireStream 9170 hardware:

  • Only company positioned to offer a unique platform with strengths in accelerated GPU as well as CPU computing
  • Stream computing today leading to fusion tomorrow

AMD's open systems SDK approach:

  • CTM initiative — Release low level specifications to enable developers and end users to understand the architecture and tuning to maximize performance
  • Deliver high level, multi-targeted compilers through Brook, 3rd parties like RapidMind, and partnerships with universities and industry.
  • Deliver library functions through AMD's ACML, APL, Cobra, and through university partner program.

View AMD FireStream 9170 specifications.

When can I get an AMD stream computing product and what does it cost?
The FireStream 9170, AMD's flagship stream computing platform, is scheduled to be available in Q1 2008 in quantity. Please contact us for a price quote.

software development kit containing compilers, libraries, performance profiling tools and drivers is available for download. SDK version 1.0 will be available in Q1 2008.

This SDK is a compilation of open source software and proprietary AMD software put into the open source.

Included in the first release are compilers, performance profilers, AMD's core math library (ACML) and AMD's compute abstraction layer (CAL) which enables device programming in familiar high-level languages rather than graphics programming specific to the GPU.

Please read our stream computing whitepaper (PDF 1.1MB) for more information about this SDK.

How does AMD's stream computing address the IEEE754 standard for double precision floating point computation?
The IEEE754 standard defines formats for representing single and double-precision floating point numbers as well as some special cases like denorms, infinities and NaNs. It also defines four rounding modes and methods for exception handling.

When we were preparing to launch our stream computing initiative in 2006, a series of customer interviews was conducted to get input on requirements relative to this standard. They learned that as long as we handled the special cases according to the most common usage, complete IEEE754 compliance wasn't required. AMD's FireStream 9170 implementation should handle a large majority of customers' requirements.

In the AMD FireStream 9170:

  • Infinities and NaNs are handled as defined by the IEEE754 standard.
  • Rounding is handled using the "round to nearest" mode, which is the mode generally used in most applications.
  • Denormal numbers are flushed to zero. This is a common optimization in implementations where full-speed hardware support is not available, and is adequate for most applications.

What does AMD's software stack look like?
AMD has authored a whitepaper (PDF) that discusses our software stack.

How does the AMD FireStream support Linux?
AMD is committed to Linux and sees Linux as a major platform for Stream Computing. Recently we have announced an initiative on open source driver for Linux. We are continuing our momentum and we expect that stream computing stack will support Linux over the next calendar year.

What type of programming model does AMD use for AMD FireStream?
AMD encourages a pure streaming/SIMD model for AMD FireStream. A few enhancements like data sharing are useful for a small subset of applications. However data sharing in a SIMD environment brings its own challenges and should to be used with utmost care. In fact if used incorrectly performance might actually degrade.

Regarding specific compiler implementation choices — currently we have enabled Brook with a CAL backend. We are looking at other options as well, including industry standards.

What happened to AMD's CTM?
CAL is a natural evolution to CTM — we are building our software stack bottoms up. We provide low-level access and specs as CTM extension to CAL. CAL permits the end user to write portable code. Frequently our developers like to drop down to the next level of detail for further tuning and profiling. CAL provides this level of access.

Will the AMD FireStream SDK work on previous generation hardware?
To run the CAL/Brook+ SDK, you need a platform based on the AMD R600 GPU or later. R600 and newer GPUs are found with ATI Radeon™ HD2400, HD2600, HD2900 and HD3800 graphics board.

Which applications are best suited to Stream Computing?
Applications best suited to stream computing possess two fundamental characteristics:

  1. A high degree of arithmetic computation per system memory fetch
  2. Computational independence — arithmetic occurs on each processing unit without needing to be checked or verified by or with arithmetic occurring on any other processing unit.

Examples include:

  • Engineering — fluid dynamics
  • Mathematics — linear equations, matrix calculations
  • Simulations — Monte Carlo, molecular modeling, etc.
  • Financial — options pricing
  • Biological — protein structure calculations
  • Imaging — medical image processing

If Stream processors are really GPUs, will I need to learn graphics programming to properly implement my application?
No. AMD along with the open source community are working to mask the GPU's graphics programming heritage. This is being accomplished by our release of Brook+, the open source Brook compiler plus AMD enhancements geared directly at non-graphics stream computing, and AMD's CAL — Compute Abstraction Layer. CAL provides high-level language access to the various parts of the GPU as needed.

Developers are thus able to write directly to the GPU without needing to learn graphics-specific programming languages. CAL provides direct communication to the device.

Will future stream computing architectures force me to rewrite my applications?
Implementing a new algorithm or application in a stream computing environment will require the use of various stream-specific techniques. These techniques and tools are all available in the AMD FireStream SDK described above.

Existing applications that currently use only the CPU for computation will require recompiling to take advantage of the capabilities of the stream processor.

We anticipate most applications running in a stream computing environment in the near term will be applications written from scratch with the intent of implementing a stream computing platform.

Over time as applications undergo typical rewrites and recompiles, those applications naturally suited to the stream computing environment will migrate to this environment along with the necessary recoding and recompiling tasks.

Is stream computing a return to the old co-processor days?
In many ways stream computing does resemble the days when vector co-processors handled substantial mathematical tasks. The benefit then as now is the remarkable performance boost gained through implementing these specialized components.

We fully anticipate technological advancements as well as programming techniques to pull these co-processors closer to the CPU over time until, as with earlier co-processors, they disappear into the CPU itself.

AMD's competitors offer similar but non-standardized products. Should I wait on product standardization before exploring Stream Computing?
AMD is focused on providing the tools necessary to help our customers succeed with our AMD FireStream products, and we believe the open systems approach is a critical component of this philosophy. Open systems enables AMD along with partners and 3rd party vendors to collaborate closely when developing highly integrated solutions as well as work independently when targeting a niche solution.

AMD's open systems philosophy includes:

  • Open IL and ISA specifications to ensure developers can optimize system performance
  • Support for AMD Brook+ along with other 3rd party high-level tools to provide a choice of familiar development environments
  • Open source Linux drivers and AMD-enhanced Brook+ enabling developers to modify and retarget tools as needed
  • AMD partnership opportunities with system vendors and integrators to deliver customer-focused solutions

Who can I contact at AMD for more information?
Contact us with general questions about AMD's stream computing initiative, products, sales and training.

Contact us with technical questions about FireStream hardware, software or developer issues.

AMD Compute Abstraction Layer (CAL) FAQ

How do I get started with AMD CAL SDK?
We have assembled a number of documents to help guide you through the setup and early use of CAL. Please read these before getting started:

We have also authored a programmer's guide which is included in the SDK download.

Note that the three CAL files are also included in the SDK. We have posted them here as well for your reference prior to installing CAL.

Why does the integrated installer (setup.exe) behave badly?
The integrated installer is designed to invoke the CAL and BROOK+ installers in that order. Most of the installation logic is present in each individual MSI.

If you have problems with this installer, please remove any previous versions of CAL/BROOK+ using Add/Remove programs in Control Panel and try again.

Does the Repair/Modify option update my previous installation of CAL/BROOK+?
No, you would need to completely remove the previous version of CAL/BROOK+. The Repair/Modify option only repairs the current version of CAL/BROOK+.

What is CALROOT and where is it set and used?
CALROOT is set as the path to the CAL SDK during installation. It is defined in the current user's Environment Variables. Other users on the system have to define CALROOT in their environment or in the system environment.

The Visual Studio project files for the samples use CALROOT to locate the CAL headers and libraries. Some sample projects also use CALROOT to load themselves. You would not need CALROOT unless you wish to build the samples.

Brook+ FAQ

How do I get started with Brook+?
Brook+ requires the following be installed or available to work with AMD FireStream technology:

  • Visual Studio (for Windows developers) or GCC (for Linux developers) installed and all environment variables correctly set up
  • Cygwin (for Windows) — must appear later in the PATH variable than the Visual Studio tools
  • CAL SDK installed from the same source as you obtained the source tree from
  • CALROOT environment variable set

To build the compiler and runtime, enter the "platform" directory and type "make". This generates and fills the "sdk" directory.

To build the samples, first build the sdk as above, then enter the "samples" directory and type "make".

Where can I get CAL support?
Information was shipped in the installer.

What graphics driver does Brook+ require?
Brook+ has no direct dependency on the graphics driver. CAL working correctly with a given driver should result in Brook+ working correctly.

How is BROOKROOT used?
The makefiles need CALROOT defining. BROOKROOT is not used by the makefile system.

 

Loading mentions Retweet
Filed under  //  HIVE   IBM   new   pgm   stream computing   technology  
Comments (0)
Posted 4 months ago

HIVE

MY PREVIOUS ARTICLE ABOUT HIVE:

http://swathidharshananaidu.posterous.com/open-source-hive-large-scale-distributed-data

GET more from http://wiki.apache.org/hadoop/Hive

What is Hive

 Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built in capabilities of the language.

Hive does not mandate read or written data be in "hive format" - there is no such thing; Hive works equally well on Thrift, control delimited, or your data format. Please see File Format and SerDe in Developer Guide for details.

What Hive is NOT

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can 'set mapred.reduce.tasks=1' and 'set mapred.map.tasks=1' and the query time will be ~15 seconds.

 

Table of Contents

  1. Hive introduction videos From Cloudera
  2. Preparations
    1. Requirements
    2. Downloading and building
    3. Running Hive
    4. Configuration management overview
    5. Error Logs
  3. DDL Operations
    1. Metadata Store
  4. DML Operations
  5. SQL Operations
    1. Runtime configuration
    2. Example Queries
      1. SELECTS and FILTERS
      2. GROUP BY
      3. JOIN
      4. MULTITABLE INSERT
      5. STREAMING

 

DISCLAIMER: This is a prototype version of Hive and is NOT production quality. However, we are working hard to make Hive a production quality system. Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now - although it may very well work on other similar platforms. It does not work on Cygwin right now. Most of our testing has been on Hadoop 0.17.2 - so we would advise running it against this version of hadoop - even though it may compile/work against other versions

Hive introduction videos From Cloudera

www.png" height="11" alt="[WWW]" style="" width="11" /> Hive Introduction Video

www.png" height="11" alt="[WWW]" style="" width="11" /> Hive Tutorial Video

Preparations

Requirements

  • Java 1.6

  • Hadoop 0.17.x to 0.19.x. Support of Hadoop 0.20.x is in progress: [www.png" height="11" alt="[WWW]" style="" width="11" /> HIVE-487]

Downloading and building

Hive is available via SVN at: www.png" height="11" alt="[WWW]" style="" width="11" /> http://svn.apache.org/repos/asf/hadoop/hive/trunk

  $ svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
  $ cd hive
  $ ant -Dhadoop.version="<your-hadoop-version>" package
  # For example
  $ ant -Dhadoop.version="0.17.2" package
  $ cd build/dist
  $ ls
  README.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)

In the rest of the page, we use build/dist and <install-dir> interchangeably.

Running Hive

Hive uses hadoop that means:

  • you must have hadoop in your path OR

  • export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform this setup

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

I also find it useful but not necessary to set HIVE_HOME

  $ export HIVE_HOME=<hive-install-dir>

To use hive command line interface (cli) from the shell:

  $ $HIVE_HOME/bin/hive

Configuration management overview

- hive default configuration is stored in <install-dir>/conf/hive-default.xml

  • Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml

- log4j configuration is stored in <install-dir>/conf/hive-log4j.properties

- hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default.

- hive configuration can be manipulated by:

  • editing hive-site.xml and defining any desired variables (including hadoop variables) in it

  • from the cli using the set command (see below)

  • by invoking hive using the syntax:

    • $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2

      • this sets the variables x1 and x2 to y1 and y2 respectively

Error Logs

Hive uses log4j for logging. By default logs are not emitted to the console by the cli. They are stored in the file: - /tmp/{user.name}/hive.log

If the user wishes - the logs can be emitted to the console by adding the arguments shown below: - bin/hive -hiveconf hive.root.logger=INFO,console

Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to [MAILTO] hive-dev@hadoop.apache.org.

DDL Operations

Creating Hive tables and browsing through them

  hive> CREATE TABLE pokes (foo INT, bar STRING);  

Creates a table called pokes with two columns, the first being an integer and the other a string

  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);  

Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).

  hive> SHOW TABLES;

lists all the tables

  hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation [WWW]http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:

  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:

  hive> DROP TABLE pokes;

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db

Right now, in the default configuration, this metadata can only be seen by one user at a time.

Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.

In the future, the metastore itself can be a standalone server.

If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.

DML Operations

Loading data from flat files into Hive:

  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; 

Loads a file that contains two columns separated by ctrl-a into pokes table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS.

The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets.

NOTES:

  • NO verification of data against the schema is performed by the load command.

  • If the file is in hdfs, it is moved into the Hive-controlled file system namespace. The root of the Hive directory is specified by the option 'hive.metastore.warehouse.dir' in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.

  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.

  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SQL Operations

Runtime configuration

  • Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the hadoop configuration variables.

  • The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:

    hive> SET mapred.job.tracker=myhost.mycompany.com:50030
    hive> SET -v 
  • The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop configuration are displayed

  • In particular, the number of reducers should be set to a reasonable number to get good performance (the default is 1!)

Example Queries

Some example queries are shown below. They are available in build/dist/examples/queries. More are available in the hive sources at ql/src/test/queries/positive

SELECTS and FILTERS
  hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';

selects column 'foo' from all rows of partition <DATE> of invites table. The results are not stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local directory or HDFS directory) is optional.

  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';

selects all rows from partition <DATE> OF invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects all rows from pokes table into a local directory

  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used

GROUP BY
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
JOIN
  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
MULTITABLE INSERT
  FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
STREAMING
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)

last edited 2009-07-24 08:28:25 by ZhengShao

 

Loading mentions Retweet
Filed under  //  cloud computing   HIVE   sql  
Comments (0)
Posted 4 months ago

Open source Hive: Large-scale, distributed data processing made easy

Thank heaven for Hive, a data analysis and query front end for Hadoop that makes Hadoop data files look like SQL tables

Suppose you want to run regular statistical analyses on your Web site's traffic log data -- several hundred terabytes, updated weekly. (Don't laugh. This is not unheard of for popular Web sites.) You're already familiar with Hadoop (see InfoWorld's review), the open source distributed processing system that would be ideal for this task. But you don't have time to code Hadoop map/reduce functions? Perhaps you're not the elite programmer that everyone in the office thinks you are.

What you'd like to do is dump all that information into a database, and execute a set of SQL queries on it. But the quantity of data would overwhelm even an enterprise-level RDBMS.

[ Read the InfoWorld Test Center's hands-on account of working with Amazon Elastic MapReduce and Amazon Web Services. | Keep abreast of cloud computing news by visiting InfoWorld's Cloud Computing channel. ]

This is precisely the problem that engineers at Facebook encountered. They became interested in Hadoop as a means of processing their Web site's traffic data that was generating terabytes per day, was growing, and was overtaxing their Oracle database. Though they were happy with Hadoop, they wanted to simplify its use so that engineers could express frequently used analysis operations in SQL. The resulting Hadoop-based data warehouse application became Hive, and it helps to process more than 10TB of Facebook data daily. Now Hive is available as an open source subproject of Apache Hadoop.

Inside the Hive
Written in Java, Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Employing Hadoop's distributed file system (HDFS) as data storage, Hive inherits all of Hadoop's fault tolerance, scalability, and adeptness with huge data sets. When you run Hive, you are deposited into a shell, within which you can execute Hive Data Definition Language (DDL) and HQL commands. A future version of Hive will include JDBC and ODBC drivers, at which time you will be able to create fully executable "Hive applications" in much the same way that you can write a Java database application for your favorite RDBMS. (The current version of Hive -- 0.3.0 -- does have limited support for JDBC, but can only dispatch queries and fetch results.)

To install Hive, you simply install Hadoop and add a couple of download and configuration steps. (To install Hadoop, the best tutorial I've found is on Michael Noll's blog.) Or if you'd rather just get straight to testing Hive without all the installation nonsense, you can download a VMware virtual machine image with Hadoop and Hive pre-installed. The virtual machine image is featured in an excellent Hive tutorial video available at the same Web site.

BOTTOM LINE

Apache Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Although Hive query language (HQL) commands are usually executed from within the Hive shell, you can launch the Hive Web Interface service and run HQL queries from within a browser. You can start multiple queries, and the Web interface will let you monitor the status of each.

I already had Hadoop running on an Ubuntu 8.10 system. To add Hive, I downloaded the gzip file from hadoop.apache.org/hive, and unpacked it into a folder next to the Hadoop home folder. Next, I defined a HIVE_HOME environment variable, and executed a few HDFS commands to create specific HDFS subdirectories that Hive requires. I launched the Hive shell and was ready to go. Total time was maybe 20 minutes. (This process is described in Hive's wiki, just off the Hive main Web page.)

HQL and SQL
Although Hive's principal goal is to provide an SQL-like query mechanism for Hadoop-based data, mimicry of SQL in such an environment can -- for a variety of reasons -- go only so far. First, HDFS was built for batchlike applications that pour large quantities of data into massive files that are subsequently processed by Hadoop map/reduce tasks. It is a write-once, read-often-and-sequentially file system. HDFS does not currently support random write operations and likely never will. Hence, HQL's closest approach to an SQL INSERT INTO command is INSERT OVERWRITE, which overwrites a table's existing content with new data. For example, suppose you have already created a Hive database table called TA, and you want to add new data to it from table TB. The HQL for this is:

INSERT OVERWRITE TA SELECT * FROM
(SELECT * FROM TA UNION
SELECT * FROM TB)

The new data is added by overwriting the old table with the concatenation of its original content and the data in TB.

In addition, Hive does not store database tables in a specialized file format. Instead, it causes ordinary HDFS files to "appear" to be database files. This illusion becomes apparent when you export data into a Hive table from a file stored in a standard Linux file system. No special conversion takes place; the file is copied byte for byte into Hive from its source image in the Linux directory. This means that you have to describe the structure of the file at the time you CREATE it as a Hive table.

For example, suppose I had converted the entire Encyclopedia Britannica into a single, linear text file and processed that to produce a data file consisting of word/offset pairs. For each line in the file, the first field is the text of a given word in the encyclopedia, and the second field is the large integer offset of the word's position in the text file. (So, the line "bob 1293" indicates that "bob" was the 1,293rd word in the encyclopedia.) Assuming the file's fields are separated by tab characters and the lines by line feeds, I could create a table for this file:

CREATE TABLE WORDLOC (theWord STRING, loc BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\009'
LINES TERMINATED BY '\012'
STORED AS TEXTFILE;

The structure of the file is explicitly described in the CREATE command. And when I imported the data into Hive, it would simply be copied directly, with no structural changes.

Nevertheless, Hive is impressive, particularly when you consider what is going on behind the scenes. It is converting HQL expressions into compiled-and-executed map/reduce tasks. In addition, the conversion is not a brute-force operation; Hive applies some intelligence. For example, Hive knows when conversion is unnecessary, so the simple expression "SELECT * FROM TA" will execute directly. Hive also performs "pipelining" of complex queries where possible. That is, if a query is resolved into a linear sequence of map/reduce tasks, the intermediate output of the first map/reduce job is passed on to the next job in the series, even before the first job is completed -- and so on down the line. This significantly improves throughput, as different stages in the pipeline are able to execute in parallel.

More HQL tricks
HQL is designed to be easily mastered by anyone already familiar with SQL. Though HQL is definitely a subset of SQL, it provides a surprising amount of SQL-like functionality. Hive's DDL includes commands for creating and dropping tables as well as altering table structure (adding or replacing columns). Tables can also be created with partition specifiers, which -- if strategically arranged -- can accelerate some queries. HQL's SELECT clause supports subqueries, as well as GROUP BY and SORT BY clauses. Also, you can perform multiple JOIN operations in an HQL query (though only the equality operator can be used in the JOIN conditional).

Other HQL language features have no direct SQL counterpart, but are understandable deviations when you consider HQL's raison d'etre. For example, if you already have a large table imported into Hive and want to test a query you've just written, but would rather not wait the hour you suspect the query will take, you can use Hive's TABLESAMPLE clause. Applied in conjunction with the CREATE command's CLUSTERED BY clause, adding the TABLESAMPLE clause to a query's FROM clause will involve only a subset of the entire table's data in that query, thereby reducing query execution time significantly.

[ Stay up to date on the latest open source developments with InfoWorld's Technology: Open Source newsletter. ]

Finally, if you want to add a new, user-defined function to HQL, Hive provides a plug-in mechanism whereby you can write your function (it will have to be in Java), compile it into a JAR file, and register it with the Hive infrastructure. Restart Hive, and your function is ready to use in your Hive queries.

Join the Hive
Hive is easy to install, and HQL is easy to pick up if you already know even a modest amount of SQL. And Hive has a bright future; the road map of upcoming features includes more support for languages other than Java, a HAVING clause, improvements to Hive's JOIN capabilities, additional data types, indexes, and much more.

Hive, however, is not a replacement for an RDBMS. As already mentioned, Hive does not support random row insertion or deletion. The Hive Web site makes it clear that Hive is a tool for the analysis and summarization of large datasets; it is not meant for structured, randomly accessed content storage.

Hadoop is emerging as the current darling of the cloud computing crowd, and Hive certainly assists that ascent. Creating Hadoop map/reduce tasks demands programming skills that Hive does not require (though some map/reduce jobs will always necessitate hand-coding). Still, Hive is an ideal express-entry into the large-scale distributed data processing world of Hadoop. All the ease of SQL with all the power of Hadoop -- sounds good to me.

Thanks to Facebook engineers Joydeep Sen Sarma and Ashish Thusoo for their assistance with this article.

VIA: INFOWORLD

Loading mentions Retweet
Filed under  //  cloud computing   HIVE   social networking   sql   technology  
Comments (0)
Posted 4 months ago