Friday, March 27, 2009

Unwritten guide to Yahoo Query Langauge

I've recently have had a chance to play around with the Yahoo Query Language in the redesign of my homepage. I have found serveral uses for this versitile tool for AJAX developers. The three function I found to be the most useful would have to be the JSON, FEED, and HTML data queries. I recommend that you open the Yahoo Query Language Console so you can run the queries I list to get a feel for what data is returned.


JSON

The first and most obvious usage is the JSON data query:


Query 1. JSON from Yahoo Pipes
select * from json where
url="http://pipes.yahoo.com/pipes/pipes.popular?_out=json"
and itemPath = "value.items"
and categories.module = "regex"

Although the above query is fairly simple there are a few details that we can look at. The statement (as will all YQL statements) starts with the 'select' keyword. That is followed with '*', this symbol designates that you want all data elements to be returned. If you wanted to limit the number of fields returned by the query you would list those fields in place of the '*' symbol.

After the 'from' keyword is where you list the datasource, in this case json is the datasource. The datasource is followed by the 'where' keyword, this section of the query is where the important stuff happens. In the where clause is where you list the requirements that the data must meet in order to be returned by YQL. JSON has two pre-defined fields to filter against in the 'where' clause, they are as follows:

url
-(Required) Set this equal to the url where the JSON document can be found.
itemPath
-(Optional) Set this to select a sub-set of the data. In the previous query the contents of the base value's 'items' field is selected as the base for the datasource. Think of it as the xpath for JSON.

In addition to these predefined fields you can also filter against any field in the datasource. In the example query (Query 1.) we have listed that the 'categories' must contain a 'module' field that equals 'regex'.

Note: Another good reason to use the JSON datasource is when you have to consume via Javascript an api that doesn't support callbacks (JSONP). You can run the api through a YQL query to wrap it in a callback.

FEED

The FEED datasource is one that can be easy or hard depending on the kind of feed you will be consuming (RSS or Atom).


Query 2. FEED from Atom feed on http://jawtek.blogspot.com/
select title, content.content,published, link.href from feed where
url='http://jawtek.blogspot.com/feeds/posts/default'
and content.type = 'html'
and link.rel = 'alternate'

Query 3. FEED from RSS feed on http://jawtek.blogspot.com/
select title, description, pubDate, link from feed where
url='http://jawtek.blogspot.com/feeds/posts/default?alt=rss'

I'll spare you the basics this time and just focus what makes FEED different from JSON. The first thing you probably noticed is the fact that Atom (Query 2.) query is longer than the RSS (Query 3.) query, this has nothing to do with YQL but more to do with the differences in the two formats. Where Atom is more verbose and takes more filtering to get the data you want RSS being the older of the two formats contains less data. Now I'm not going to say one is better than the other but what I will say is that Atom has the potential to allow you to be more specific in what data you colletc while at the same time it can make for some large queries. FEED just like JSON allows you to filter on any field defined in the feed but only has on pre-defined field:

url
-(Required) Set this equal to the url where the feed can be found.

Note: Because Atom and RSS give the same elements different names, your Javascript will have to be prepared to deal with that.

HTML

The final datasource I'll discuss is HTML. The HTML datasource is usefull when the data you want is in an HTML Document. You could use it to get data from a site that doesn't make it's data available through JSON or a FEED.

Query 4. HTML query to scrape data from Yahoo Finance
select * from html where
url="http://finance.yahoo.com/q?s=yhoo"
and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'

Now what separates the HTML datasource from the previous two is that you do everything through pre-defined fields. It does not seem possible to filter on elments, atleast not that I've found. This isn't an issue since the query relies on xpath which is a very powerful. There are four pre-defined fields:

url
-(Required) The url of the HTML document you would like to query
xpath
-(Optional) The xpath to search the document with. The finer points of xpath are beyond this article but the xpath in the query (Query 4.) says, starting anywhere ('//') look for a 'div' with ('[]') the attribute ('@') 'id' set to the value 'yfi-headlines' going down one element ('/') to the second ('[2]') div. From there go down ('/') each 'ul' to all 'li's to get all 'a's.
charset
-(Optional) This field is to define the Character Set of the html document you are quering. A value is not required here as YQL automatically determines the character set of the html document.
browser
-(Optional) I have not been able to determine the usage of this field. The console tells you that the only allowable value for it is 'me'.

Note: Although I list the xpath field as optional I highly recommend against leaving it blank because it defaults to '/html/body' which means everything in the body.