Creating an Atom feed in PHP

Brian M. Carey, Information Systems Consultant, Triangle Information Solutions


Summary:  Atom is an XML specification that identifies information contained in a Web site. Using Atom, Web developers produce feeds that enable other Web developers (or consumers who use feed readers) to quickly locate and view information of interest on a remote site. Think of it as a Web site's index, available to anyone who wants it. Using PHP, a popular language of choice for most host providers, a Web developer can easily produce an Atom feed that can then be made available to the various feed readers and other Web developers. The ultimate result is a state-of-the-art information solution that enables the Web content to reach a much wider audience.

 

What is Atom?

Atom, as it is used here, refers to an XML language that enables Web publishers to syndicate the content of their Web sites to various consumers. Using Atom, publishers are able to create a Web feed in a standardized format. This feed enables users to read the contents of the Web site with software known as a feed reader. It also enables other Web developers to publish the contents of the feed on their own Web sites.

Atom is by no means the only syndication standard in use today. RSS is another standardized format (also using XML) and predates Atom. In fact, Atom was created in response to certain limitations in RSS.

As a result, the Atom specification contains numerous advantages over RSS. Atom provides a means to define the format of the data being provided—for example, HTML, XHTML, and so on—whereas RSS does not. Atom, unlike RSS, supports internationalization with the xml:lang attribute. Atom also accepts more state-of-the-art (and standardized) date formatting, relying on Request for Comments (RFC) 3339 as opposed to RSS's RFC 822.

Why PHP with Atom?

PHP stands for PHP: Hypertext Processor. It might be the only acronym in the English language that, when expanded, still contains the original acronym. The historical significance here is that PHP originally stood for Personal Home Page.

PHP is a scripting language that produces dynamic, server-side content. It works harmoniously with HTML, and PHP code is frequently embedded within standard HTML Web pages to facilitate dynamic content.

PHP also works extremely well with MySQL, a database management system. Over the years of Web development, these two technologies have evolved together and worked side by side on countless occasions. This is almost certainly due to one undeniable, overarching rationale: They are both free.

In answer to the question at the top of this section, PHP gives developers the flexibility of producing dynamic content in an easy-to-read and easy-to-develop manner. The dynamic content is retrieved from a MySQL database. The output page (the feed) is coded using PHP so that it renders an XML output that conforms to the Atom specification.

Note that the explanations in this article were written assuming that you are familiar with the basics of MySQL and PHP. If you are not, see the links to introductory tutorials in the Resources section of this article.

Defining the business use case: Fishing reports

Your boss is in your office. He really likes the way that the company's Web site (fishinhole.com) is operating. The site currently markets and sells fishing tackle of all types to enthusiastic sport fishermen. The site also provides a forum for fishing reports, wherein said enthusiastic sport fishermen share their fish tales.

Your boss takes a seat (without asking) in a chair in your office and complains that the Web site isn't getting enough broad exposure. He wants to use the fishing reports section of the Web page to lure (pun intended) more enthusiastic sport fishermen to the Web site. He tells you that he wants you to make that section of the Web site a "one-stop shop" for sport fishing reports worldwide. This is essential to your ongoing success with fishinhole.com (or so he says). Your boss slurps his coffee, smiles, and walks out of your office with nothing else to say.

You lean back in your chair and get to thinking: What could give the fishing reports forum broader exposure? A moment later it comes to you: syndication! Instead of simply making the reports section available to users and shoppers at fishinhole.com, you can syndicate the forum so that people can read synopses of the fishing reports with their feed readers. Other Web developers might also include the syndication feed in their own Web pages. In either case, people would click on report titles of interest and be linked back to fishinhole.com, where you can expose them to a barrage of fishing tackle direct marketing. It's a great idea.

The database design

Long before your boss walked into your office, the database for the fishing reports forum was already designed. Recall that the fishing reports section of the Web page already exists. It simply hasn't been syndicated yet.

So what changes to the database do you need to make to syndicate its contents? None! That's one of the great things about syndication. In most cases, you can syndicate articles without changing your underlying schema or data model. This is because in most cases, you will syndicate articles, and articles almost always have the information that is required by the Atom specification.

Listing 1 shows the database model currently used by the fishing reports section of fishinhole.com. It also contains someINSERTs so that you have test data.

 

Listing 1. REPORTS table structure with INSERTs



CREATE TABLE IF NOT EXISTS `reports` (
  `ID` bigint(20) NOT NULL auto_increment,
  `AUTHOR` varchar(32) NOT NULL,
  `TITLE` varchar(64) NOT NULL,
  `SUBTITLE` varchar(128) NOT NULL,
  `CONTENT` varchar(2000) NOT NULL,
  `POSTED` datetime NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=5 ;


INSERT INTO `reports` (`id`, `author`, `title`, `subtitle`, `content`, `posted`) VALUES
(1, 'BigRed', 'Spanish Bite Looking Good!', 'Near the Cape!', 
'Trolled for 3 hours and limited out on Spanish Macks!  Watch out for the shallows 
near green can #4.', '2009-05-03 04:54:33'),
(2, 'JonBoy', 'Big Rock Report', 'Spring has sprung', 
'Caught several blackfins and mahi just outside of the Big Rock on Saturday.  
We were using flourescent squid teasers with ballyhoo for hookups.  
One Mahi weighed over 50#!', '2009-05-03 04:56:06'),
(3, 'Erasmus', 'Drum in the backwaters', 'The bite was hot!', 
'Loaded up against the marsh grass, boys.  Go get em.  I was using gulp 
with 1/4 ounce jigheads.', '2009-05-03 04:57:19'),
(4, 'ReelHooked', 'Speckled Trout In Old River', 'Limited out by noon', 
'They were schooling heavy in Old River.  They would eat anything we would 
throw at them.  Most were undersized, but we managed to keep some 
and had our fill by midday.', '2009-05-03 04:59:00');


If you are going to actually test the code presented in this article, you can do so by creating a MySQL database called fishinhole and executing the code from Listing 1 in that database.

The first column (ID) is the primary key of the table. Note that it uses the auto_increment specification so that every time a new row is inserted into the table, the ID column is populated with an increment of the previous row's ID column. This is similar to a sequence in an Oracle table.

The AUTHOR column simply specifies the user name of the person who posted the fishing report. This is the user's screen name, as opposed to the user's real first and last name (unless the user's real name is, in fact, the screen name).

The TITLE column is simply the title of the article. Likewise, the SUBTITLE is the subtitle of the article and is used in the Atom feed for the article synopsis.

The CONTENT column is the actual fishing report itself. Because the Atom feed produced here only includes a synopsis of the overall article (thus encouraging users to click the link and access the Web site), the content itself is not displayed in the Atom feed.

Finally, the POSTED column is a DATETIME column that stores the date that the article was posted on the Web site.

To keep things simple, I provide only a few articles (4). In a real-world situation, there would be thousands of these articles from hundreds of different authors.

The work commences

Now that you have the database design in place, it's time to code the PHP page so it produces an Atom feed. This article will walk you through the basics of creating a simple Atom feed, which you can test using PHP.

Please note that if you want to test the code you need access to a PHP processor. Most hosting solutions provide such access. You may also have access to one locally. Consult with the necessary system administrators or technical support staff to find out how you can execute PHP documents in your Web environment.

Accessing the database in PHP

Create a new PHP file called syndication.php. That's where you'll put your code.

As mentioned previously, PHP and MySQL have a rich history of working extremely well together. Some might go so far as to say they are married. But such judgments are not in the scope of this article.

Listing 2 provides a basic code snippet that enables you to access the MySQL database created with the code from Listing 1.


Listing 2. Accessing the MySQL database in PHP

$link = mysql_connect('localhost', 'admin', 'password')
    or die('Could not connect: ' . mysql_error());

mysql_select_db('fishinhole') or die('Could not select database');

$query = 'SELECT id,title,subtitle,author,posted 
     FROM reports order by posted desc limit 25';
;
$result = mysql_query($query) 
     or die('Query failed: ' . mysql_error());


That code actually covers quite a bit, so it's important to go over it step by step.

First comes the mysql_connect() function. You need to change the parameters according to the specifications of your own environment. The first parameter is the database host. In some cases, it will be just like Listing 2 (that is, localhost). In other cases it will be a remote host (for example, IP Address 10.92.2.1). It might also be an actual host name, assuming that you have Domain Name System (DNS) revolution (for example, mysql.myhost.com).

The second parameter is the name of the MySQL user who will access the database. You can use admin, as shown, only if that is a valid account within your own MySQL environment. Refer to the MySQL documentation to learn how to create accounts for a MySQL database. Keep in mind that the account used here must have read rights to the REPORTS database.

The third parameter is the user's password. This needs to match the password used for the user identified in the second parameter.

The or die clause is included to provide the developer with diagnostic information in the event of a failure. If a connection cannot be made to the database management system, you receive a message indicating that the connection failed and the reason it failed when the PHP script is executed.

Next comes the mysql_select_db() function. This is where you actually select which database you plan to use. Your own copy of MySQL can (and likely does) contain many databases, so it's important to specify which one you want to use.

Recall that I recommended you create a MySQL database called fishinhole if you want to test the functionality of the code provided in this article. The mysql_select_db() line specifies that you will use that database.

Next comes the actual query. In this case, you just define the query in a string. Here you grab the IDTITLESUBTITLEAUTHOR, and POSTED columns from the REPORTS table. The order by posted desc clause forces the query to return rows in descending order by the date in the POSTED column (which is the date that the article was posted on the Web site). So, you retrieve the most recent articles first. This is a standard practice for feeds.

The limit 25 clause at the end is important. This is where you specify that you want a maximum of 25 articles returned for this feed. Recall I mentioned earlier that forums such as this one can have thousands of articles. It is simply not practical to return thousands of articles in a feed. The bandwidth used is significant, and most consumers end up waiting for awhile.

This query is a string. It is assigned to a variable intuitively named $query.

In the mysql_query() function, you actually execute the query defined in the previous line. The results of that query are stored in the $result variable. Once again, the or die clause is in place for diagnostic purposes.

The loop and the Atom specification

Now that you have the data from the database, it's time to start displaying it in a format that conforms to the Atom specification. Because Atom is an XML language, the output of the PHP file is in XML format, as opposed to HTML format. If you intend to use a Web browser to display the output, just keep in mind that it will display differently depending upon your browser and version. To view the XML output, it's usually best to right-click on the output in a browser and select View Source. Then you will see the raw XML output.

Before displaying information about each article, it's important to include the preamble to the Atom feed. This is the section that identifies the output as an Atom feed and provides pertinent information about the feed, as shown in Listing 3.


Listing 3. The Atom preamble


<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom"> 
     <title>Fishing Reports</title> 
     <subtitle>The latest reports from fishinhole.com</subtitle>
     <link href="http://www.fishinhole.com/reports/syndication.php" rel="self"/> 
     <updated><?php echo date3339(); ?></updated>
     <author> 
          <name>NameOfYourBoss</name>
          <email>nameofyourboss@fishinhole.com</email>
     </author>
     <id>
     tag:fishinhole.com,2008:http://www.fishinhole.com/reports/syndication.php
     </id> 



You might immediately notice that the code in Listing 3 doesn't look like PHP. That's because most of it isn't. It's standardized output that requires little in the way of dynamic content.

The <feed> element identifies this XML document as an Atom feed. The namespace used to define the elements is provided as an attribute of the <feed> element. You also use the aforementioned xml:lang attribute to specify that this is a document written in English.

The <title> element specifies a title for the overall feed. Likewise, the <subtitle> element specifies a subtitle for the overall feed.

The <link> element specifies the URL of this syndication.php document. The address in the example works in the fictitious world that is described in this article, but in real life it does not. In reality, you can include a link that produces the output of this feed.

The <updated> element produces a timestamp (compliant with the RFC 3339 standard) that tells the consumer of this feed when it was last updated. In this case, since the feed will always be up to date because it retrieves the latest data from the database, you use the current timestamp. And you may notice that there is a little snippet of PHP code in this element. That is a custom-built PHP function that produces a timestamp in RFC 3339 format.

The <author> element defines the author of the overall feed. You'll be using your boss's name as the author because it was his idea.

Finally, the <id> element uniquely identifies the feed in an Internationalized Resource Identifier (IRI) format.

Listing 4 is the main loop that produces each entry in the Atom feed. The vast majority of the work for producing the feed is done here.

Listing 4. The loop

<?php
     $i = 0;
     while($row = mysql_fetch_array($result))
       {
          if ($i > 0) {
               echo "</entry>";
           }

           $articleDate = $row['posted'];
           $articleDateRfc3339 = date3339(strtotime($articleDate));
           echo "<entry>";
           echo "<title>";
           echo $row['title'];
           echo "</title>";
           echo "<link type='text/html' 
                    href='http://www.fishinhole.com/reports/report.php?
                    id=".$row['id']."'/>";
           echo "<id>";
           echo "tag:fishinhole.com,2008:http:
                    //www.fishinhole.com/reports/report.php?id=".$row['id'];
           echo "</id>";
           echo "<updated>";
           echo $articleDateRfc3339;
           echo "</updated>";
           echo "<author>";
           echo "<name>";
           echo $row['author'];
           echo "</name>";
           echo "</author>"; 
           echo "<summary>";
           echo $row['subtitle'];
           echo "</summary>";

           $i++;
     }			
?>


Once again, Listing 4 covers quite a bit of ground. First, is the while loop. Basically, this part of the code says, in English, "as long as there are rows in the table that haven't been included in the output yet, keep going." The current row in each iteration is stored in a PHP variable intuitively called $row.

Then the counter ($i) is checked. If the counter is more than 0, then that means this is at least the second iteration. In that case, it is necessary to close the previous iteration's <entry> element.

The next two lines retrieve the article date (from the POSTED column) and convert it to RFC 3339 format using the aforementioned function.

Next, the <entry> element is started. Following that is the <title> element, which is populated from the TITLE column in the current row.

The <link> element is unusual in that it doesn't contain any child text. Instead, the actual link is referenced as an attribute. This is part of the Atom standard. The link simply points the user to the URL where the user can read the entire article. Recall that this feed provides only a synopsis to the user.

The <id> element is similar to the one that was described previously. It uniquely identifies this element in IRI format. And, as before, it is constructed from the relevant URL.

The <updated> element contains the DATETIME value (in RFC 3339 format) from the POSTED column. Recall that the$articleDateRfc3339 variable for this document was populated earlier in this iteration.

Next comes the <author> element. This element, unlike the others (but like the <author> element in the preamble) has child elements. For this article, only one of those children is used: the author's name. The author's name is populated from theAUTHOR column of the current row.

The <summary> element contains the information gleaned from the SUBTITLE column of the current row.

Finally, the loop counter ($i) is incremented, and the loop continues.

That, in a nutshell, is the entire body of code associated with producing an Atom document from the REPORTS table. As you can see, it's not as complicated as it might seem at first.

Also, keep in mind that many elements in the Atom specification are not covered here. You can just as easily add those by following the same patterns I describe in this section of the code. For more information, see Resources.

Test it!

Now comes the fun part: testing!

Rather than retype (or copy and paste) everything you see in the code listings above, you can simply use the PHP file that is included in the Download section. Copy that file to a local directory and make the necessary database changes that I described earlier (user name, password, and host). Then copy it to a PHP file structure that has access to the database.

When you have the PHP file in the correct place, launch your browser and access your file as follows: http://your host/context/syndication.php.

As with any customized solution, you need to change the values in italics to match your specific environment.

As I stated previously, your results will vary depending upon which browser and version you use. Some of the more modern browsers detect that this is an Atom feed and display the results accordingly. Others display it in raw XML format. Still others might produce nothing because the document is not a standard HTML document.

If the browser does not display the raw XML, you can do so simply by right-clicking on the document and selecting View Source. After you do that, you should see something similar to Listing 5.


Listing 5. The output (abbreviated)


<?xml version='1.0' encoding='iso-8859-1' ?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <title>Fishing Reports</title>
  <subtitle>The latest reports from fishinhole.com</subtitle>
  <link href="http://www.fishinhole.com/reports" rel="self"/>
  <updated>2009-05-03T16:19:54-05:00</updated>
  <author>
   <name>NameOfYourBoss</name>
   <email>nameofyourboss@fishinhole.com</email>
  </author>
  <id>tag:fishinhole.com,2008:http://www.fishinhole.com/reports</id>
  <entry>
   <title>Speckled Trout In Old River</title>
   <link type='text/html' href='http://www.fishinhole.com/reports/report.php?id=4'/>
   <id>tag:fishinhole.com,2008:http://www.fishinhole.com/reports/report.php?id=4</id>
   <updated>2009-05-03T04:59:00-05:00</updated>
   <author>
    <name>ReelHooked</name>
   </author>
   <summary>Limited out by noon</summary>
  </entry>
...
</feed>


Another way to test it is to verify that the feed is valid. You can do that using one of the many Atom feed validators you can find in cyberspace. A good one to use is http://www.feedvalidator.org. That Web site validates feeds in Atom, RSS, and Keyhole Markup Language (KML) formats.

Business Results

Because you implement and deploy your Atom feed, thousands of new enthusiastic sport fishermen from around the world now have exposure to the fishing reports on your Web site. You are getting hundreds of incoming links from sport fishing sites that are embedding your Atom feed. Some enthusiastic sport fishermen are even using feed readers to view the reports on a daily basis.

Your boss pops back into your office after looking at the latest traffic reports. He is pleased with the additional visits and reports that unique visitors have increased by 10%. He gives you a thumbs up, slurps his coffee, and walks away.

Conclusion

The Atom specification is an ideal means of syndicating your Web content. Using PHP with MySQL, you can easily produce a Web feed that complies with the Atom standard and is always up to date because it reads directly from the database. The feed can then be read by a feed reader or embedded in other Web sites. The end result is broader exposure for your Web content, and that means more visitors and, most likely, an increase to your bottom line.

 

Loading mentions Retweet
Filed under  //  IBM   pgm   sql   website  
Comments (0)
Posted 3 months ago

HIVE

MY PREVIOUS ARTICLE ABOUT HIVE:

http://swathidharshananaidu.posterous.com/open-source-hive-large-scale-distributed-data

GET more from http://wiki.apache.org/hadoop/Hive

What is Hive

 Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built in capabilities of the language.

Hive does not mandate read or written data be in "hive format" - there is no such thing; Hive works equally well on Thrift, control delimited, or your data format. Please see File Format and SerDe in Developer Guide for details.

What Hive is NOT

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can 'set mapred.reduce.tasks=1' and 'set mapred.map.tasks=1' and the query time will be ~15 seconds.

 

Table of Contents

  1. Hive introduction videos From Cloudera
  2. Preparations
    1. Requirements
    2. Downloading and building
    3. Running Hive
    4. Configuration management overview
    5. Error Logs
  3. DDL Operations
    1. Metadata Store
  4. DML Operations
  5. SQL Operations
    1. Runtime configuration
    2. Example Queries
      1. SELECTS and FILTERS
      2. GROUP BY
      3. JOIN
      4. MULTITABLE INSERT
      5. STREAMING

 

DISCLAIMER: This is a prototype version of Hive and is NOT production quality. However, we are working hard to make Hive a production quality system. Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now - although it may very well work on other similar platforms. It does not work on Cygwin right now. Most of our testing has been on Hadoop 0.17.2 - so we would advise running it against this version of hadoop - even though it may compile/work against other versions

Hive introduction videos From Cloudera

www.png" height="11" alt="[WWW]" style="" width="11" /> Hive Introduction Video

www.png" height="11" alt="[WWW]" style="" width="11" /> Hive Tutorial Video

Preparations

Requirements

  • Java 1.6

  • Hadoop 0.17.x to 0.19.x. Support of Hadoop 0.20.x is in progress: [www.png" height="11" alt="[WWW]" style="" width="11" /> HIVE-487]

Downloading and building

Hive is available via SVN at: www.png" height="11" alt="[WWW]" style="" width="11" /> http://svn.apache.org/repos/asf/hadoop/hive/trunk

  $ svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
  $ cd hive
  $ ant -Dhadoop.version="<your-hadoop-version>" package
  # For example
  $ ant -Dhadoop.version="0.17.2" package
  $ cd build/dist
  $ ls
  README.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)

In the rest of the page, we use build/dist and <install-dir> interchangeably.

Running Hive

Hive uses hadoop that means:

  • you must have hadoop in your path OR

  • export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform this setup

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

I also find it useful but not necessary to set HIVE_HOME

  $ export HIVE_HOME=<hive-install-dir>

To use hive command line interface (cli) from the shell:

  $ $HIVE_HOME/bin/hive

Configuration management overview

- hive default configuration is stored in <install-dir>/conf/hive-default.xml

  • Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml

- log4j configuration is stored in <install-dir>/conf/hive-log4j.properties

- hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default.

- hive configuration can be manipulated by:

  • editing hive-site.xml and defining any desired variables (including hadoop variables) in it

  • from the cli using the set command (see below)

  • by invoking hive using the syntax:

    • $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2

      • this sets the variables x1 and x2 to y1 and y2 respectively

Error Logs

Hive uses log4j for logging. By default logs are not emitted to the console by the cli. They are stored in the file: - /tmp/{user.name}/hive.log

If the user wishes - the logs can be emitted to the console by adding the arguments shown below: - bin/hive -hiveconf hive.root.logger=INFO,console

Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to [MAILTO] hive-dev@hadoop.apache.org.

DDL Operations

Creating Hive tables and browsing through them

  hive> CREATE TABLE pokes (foo INT, bar STRING);  

Creates a table called pokes with two columns, the first being an integer and the other a string

  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);  

Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).

  hive> SHOW TABLES;

lists all the tables

  hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation [WWW]http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:

  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:

  hive> DROP TABLE pokes;

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db

Right now, in the default configuration, this metadata can only be seen by one user at a time.

Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.

In the future, the metastore itself can be a standalone server.

If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.

DML Operations

Loading data from flat files into Hive:

  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; 

Loads a file that contains two columns separated by ctrl-a into pokes table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS.

The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets.

NOTES:

  • NO verification of data against the schema is performed by the load command.

  • If the file is in hdfs, it is moved into the Hive-controlled file system namespace. The root of the Hive directory is specified by the option 'hive.metastore.warehouse.dir' in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.

  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.

  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SQL Operations

Runtime configuration

  • Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the hadoop configuration variables.

  • The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:

    hive> SET mapred.job.tracker=myhost.mycompany.com:50030
    hive> SET -v 
  • The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop configuration are displayed

  • In particular, the number of reducers should be set to a reasonable number to get good performance (the default is 1!)

Example Queries

Some example queries are shown below. They are available in build/dist/examples/queries. More are available in the hive sources at ql/src/test/queries/positive

SELECTS and FILTERS
  hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';

selects column 'foo' from all rows of partition <DATE> of invites table. The results are not stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local directory or HDFS directory) is optional.

  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';

selects all rows from partition <DATE> OF invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects all rows from pokes table into a local directory

  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used

GROUP BY
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
JOIN
  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
MULTITABLE INSERT
  FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
STREAMING
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)

last edited 2009-07-24 08:28:25 by ZhengShao

 

Loading mentions Retweet
Filed under  //  cloud computing   HIVE   sql  
Comments (0)
Posted 4 months ago

Open source Hive: Large-scale, distributed data processing made easy

Thank heaven for Hive, a data analysis and query front end for Hadoop that makes Hadoop data files look like SQL tables

Suppose you want to run regular statistical analyses on your Web site's traffic log data -- several hundred terabytes, updated weekly. (Don't laugh. This is not unheard of for popular Web sites.) You're already familiar with Hadoop (see InfoWorld's review), the open source distributed processing system that would be ideal for this task. But you don't have time to code Hadoop map/reduce functions? Perhaps you're not the elite programmer that everyone in the office thinks you are.

What you'd like to do is dump all that information into a database, and execute a set of SQL queries on it. But the quantity of data would overwhelm even an enterprise-level RDBMS.

[ Read the InfoWorld Test Center's hands-on account of working with Amazon Elastic MapReduce and Amazon Web Services. | Keep abreast of cloud computing news by visiting InfoWorld's Cloud Computing channel. ]

This is precisely the problem that engineers at Facebook encountered. They became interested in Hadoop as a means of processing their Web site's traffic data that was generating terabytes per day, was growing, and was overtaxing their Oracle database. Though they were happy with Hadoop, they wanted to simplify its use so that engineers could express frequently used analysis operations in SQL. The resulting Hadoop-based data warehouse application became Hive, and it helps to process more than 10TB of Facebook data daily. Now Hive is available as an open source subproject of Apache Hadoop.

Inside the Hive
Written in Java, Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Employing Hadoop's distributed file system (HDFS) as data storage, Hive inherits all of Hadoop's fault tolerance, scalability, and adeptness with huge data sets. When you run Hive, you are deposited into a shell, within which you can execute Hive Data Definition Language (DDL) and HQL commands. A future version of Hive will include JDBC and ODBC drivers, at which time you will be able to create fully executable "Hive applications" in much the same way that you can write a Java database application for your favorite RDBMS. (The current version of Hive -- 0.3.0 -- does have limited support for JDBC, but can only dispatch queries and fetch results.)

To install Hive, you simply install Hadoop and add a couple of download and configuration steps. (To install Hadoop, the best tutorial I've found is on Michael Noll's blog.) Or if you'd rather just get straight to testing Hive without all the installation nonsense, you can download a VMware virtual machine image with Hadoop and Hive pre-installed. The virtual machine image is featured in an excellent Hive tutorial video available at the same Web site.

BOTTOM LINE

Apache Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Although Hive query language (HQL) commands are usually executed from within the Hive shell, you can launch the Hive Web Interface service and run HQL queries from within a browser. You can start multiple queries, and the Web interface will let you monitor the status of each.

I already had Hadoop running on an Ubuntu 8.10 system. To add Hive, I downloaded the gzip file from hadoop.apache.org/hive, and unpacked it into a folder next to the Hadoop home folder. Next, I defined a HIVE_HOME environment variable, and executed a few HDFS commands to create specific HDFS subdirectories that Hive requires. I launched the Hive shell and was ready to go. Total time was maybe 20 minutes. (This process is described in Hive's wiki, just off the Hive main Web page.)

HQL and SQL
Although Hive's principal goal is to provide an SQL-like query mechanism for Hadoop-based data, mimicry of SQL in such an environment can -- for a variety of reasons -- go only so far. First, HDFS was built for batchlike applications that pour large quantities of data into massive files that are subsequently processed by Hadoop map/reduce tasks. It is a write-once, read-often-and-sequentially file system. HDFS does not currently support random write operations and likely never will. Hence, HQL's closest approach to an SQL INSERT INTO command is INSERT OVERWRITE, which overwrites a table's existing content with new data. For example, suppose you have already created a Hive database table called TA, and you want to add new data to it from table TB. The HQL for this is:

INSERT OVERWRITE TA SELECT * FROM
(SELECT * FROM TA UNION
SELECT * FROM TB)

The new data is added by overwriting the old table with the concatenation of its original content and the data in TB.

In addition, Hive does not store database tables in a specialized file format. Instead, it causes ordinary HDFS files to "appear" to be database files. This illusion becomes apparent when you export data into a Hive table from a file stored in a standard Linux file system. No special conversion takes place; the file is copied byte for byte into Hive from its source image in the Linux directory. This means that you have to describe the structure of the file at the time you CREATE it as a Hive table.

For example, suppose I had converted the entire Encyclopedia Britannica into a single, linear text file and processed that to produce a data file consisting of word/offset pairs. For each line in the file, the first field is the text of a given word in the encyclopedia, and the second field is the large integer offset of the word's position in the text file. (So, the line "bob 1293" indicates that "bob" was the 1,293rd word in the encyclopedia.) Assuming the file's fields are separated by tab characters and the lines by line feeds, I could create a table for this file:

CREATE TABLE WORDLOC (theWord STRING, loc BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\009'
LINES TERMINATED BY '\012'
STORED AS TEXTFILE;

The structure of the file is explicitly described in the CREATE command. And when I imported the data into Hive, it would simply be copied directly, with no structural changes.

Nevertheless, Hive is impressive, particularly when you consider what is going on behind the scenes. It is converting HQL expressions into compiled-and-executed map/reduce tasks. In addition, the conversion is not a brute-force operation; Hive applies some intelligence. For example, Hive knows when conversion is unnecessary, so the simple expression "SELECT * FROM TA" will execute directly. Hive also performs "pipelining" of complex queries where possible. That is, if a query is resolved into a linear sequence of map/reduce tasks, the intermediate output of the first map/reduce job is passed on to the next job in the series, even before the first job is completed -- and so on down the line. This significantly improves throughput, as different stages in the pipeline are able to execute in parallel.

More HQL tricks
HQL is designed to be easily mastered by anyone already familiar with SQL. Though HQL is definitely a subset of SQL, it provides a surprising amount of SQL-like functionality. Hive's DDL includes commands for creating and dropping tables as well as altering table structure (adding or replacing columns). Tables can also be created with partition specifiers, which -- if strategically arranged -- can accelerate some queries. HQL's SELECT clause supports subqueries, as well as GROUP BY and SORT BY clauses. Also, you can perform multiple JOIN operations in an HQL query (though only the equality operator can be used in the JOIN conditional).

Other HQL language features have no direct SQL counterpart, but are understandable deviations when you consider HQL's raison d'etre. For example, if you already have a large table imported into Hive and want to test a query you've just written, but would rather not wait the hour you suspect the query will take, you can use Hive's TABLESAMPLE clause. Applied in conjunction with the CREATE command's CLUSTERED BY clause, adding the TABLESAMPLE clause to a query's FROM clause will involve only a subset of the entire table's data in that query, thereby reducing query execution time significantly.

[ Stay up to date on the latest open source developments with InfoWorld's Technology: Open Source newsletter. ]

Finally, if you want to add a new, user-defined function to HQL, Hive provides a plug-in mechanism whereby you can write your function (it will have to be in Java), compile it into a JAR file, and register it with the Hive infrastructure. Restart Hive, and your function is ready to use in your Hive queries.

Join the Hive
Hive is easy to install, and HQL is easy to pick up if you already know even a modest amount of SQL. And Hive has a bright future; the road map of upcoming features includes more support for languages other than Java, a HAVING clause, improvements to Hive's JOIN capabilities, additional data types, indexes, and much more.

Hive, however, is not a replacement for an RDBMS. As already mentioned, Hive does not support random row insertion or deletion. The Hive Web site makes it clear that Hive is a tool for the analysis and summarization of large datasets; it is not meant for structured, randomly accessed content storage.

Hadoop is emerging as the current darling of the cloud computing crowd, and Hive certainly assists that ascent. Creating Hadoop map/reduce tasks demands programming skills that Hive does not require (though some map/reduce jobs will always necessitate hand-coding). Still, Hive is an ideal express-entry into the large-scale distributed data processing world of Hadoop. All the ease of SQL with all the power of Hadoop -- sounds good to me.

Thanks to Facebook engineers Joydeep Sen Sarma and Ashish Thusoo for their assistance with this article.

VIA: INFOWORLD

Loading mentions Retweet
Filed under  //  cloud computing   HIVE   social networking   sql   technology  
Comments (0)
Posted 4 months ago