« Do you understand Google Shopping? | Main | It’s almost the 12th April, have you signed up for the Google Shopping incentive? »

Feb 25, 2013

And the Oscar goes to…The Feed Doctor for his work on REGEXGET

The Feed Doctor AwardSince there was nothing on TV this weekend except the Oscars, The Feed Doctor decided to write some award-worthy business rules. 

Often I get requests for help extracting data from a larger text field. For example, people want to get image URLs from a large block of html. REGEXGET is useful for this, since the image tags follow a definite pattern.  For example, let’s say the description attribute looks like this:

Stuff <img URL=’http://www.mysite.com/image1.jpg’ /> and more stuff

We know we need to look for the img tag, so we could do this:

REGEXGET ($description,”img URL=’[^’]+’”)

This pattern matches the text “img’, the space, the text “URL”, the equal sign, the quote, then any number of characters that aren’t a quote, then finally the closing quote.  Perfect, right? Except, here’s the output:

img URL=’http://www.mysite.com/image1.jpg’

It’s almost what we want, but not quite. We still need to do some more processing to remove everything that’s not the actual URL. Now, we could change the pattern to just match the URL, like this:

REGEXGET ($description,”http://[^’]+”)                                         

But what if there’s, say, a hyperlink in the description before the image tag? And for that matter, what if there’s more than one image tag? REGEXTGET only gets the first match, so if you really wanted the second image tag (assuming there were one), what would you do?

The good news is that regular expressions are extremely flexible, and so this weekend I updated REGEXGET to have two new optional parameters.

The first parameter is the “match index.” This is for the case when your pattern may occur more than once, and you want something besides the first one. Let’s say the description has two image URLs:

Stuff <img URL=’http://www.mysite.com/image1.jpg’ /> and more stuff. More stuff  <img URL=’http://www.mysite.com/image2.jpg’ /> etc

If you want the first image URL, the rule we’ve got will work. But what if you want the second? Let’s update the rule:

REGEXGET ($description,”img URL=’[^’]+’”, 2, 1)

The 2 is the match index. It tells REGEXGET to get the second piece of text that matches the pattern…if there is one. But what’s that 1 for, after the 2? That’s the “group index.” If you’ve done some serious regex stuff, you’re probably familiar with capture groups and back references. Capture groups are a way to mark off parts of your pattern as “reusable” chunks. The way you do it is with parentheses, and it’s how we’re going to pull just the URL out of the image tag. Let’s update the rule one more time:

REGEXGET ($description,”img URL=’([^’]+)’”, 2, 2)

Two things are different. First, I put parentheses in the pattern; notice how they are just inside the quotes, which means they surround the part of the pattern that matches the URL. Next, I changed that last input parameter from a 1 to a 2. This means to get the second capture group. Well, the first one is always the entire thing that matched the pattern. The second is just the part that matched what’s in parentheses; in this case, the second image URL.

I had a big long list of people to thank, who helped make this happen, but the music is playing now and I have to get off stage. Just one reminder: this change will be released with our next update, in March.

For more information on Business Rules, check out the SSC guide to Using Text Functions Business Rules

Blogpost by Anthony Alford, The Feed Doctor

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341d136453ef017ee8b9fd57970d

Listed below are links to weblogs that reference And the Oscar goes to…The Feed Doctor for his work on REGEXGET:

blog comments powered by Disqus