And the Oscar goes to…The Feed Doctor for his work on REGEXGET
Often I get requests for help extracting data from a larger text field. For example, people want to get image URLs from a large block of html. REGEXGET is useful for this, since the image tags follow a definite pattern. For example, let’s say the description attribute looks like this:
Stuff <img URL=’http://www.mysite.com/image1.jpg’ /> and more stuff
We know we need to look for the img tag, so we could do this:
REGEXGET ($description,”img URL=’[^’]+’”)
This pattern matches the text “img’, the space, the text “URL”, the equal sign, the quote, then any number of characters that aren’t a quote, then finally the closing quote. Perfect, right? Except, here’s the output:
It’s almost what we want, but not quite. We still need to do some more processing to remove everything that’s not the actual URL. Now, we could change the pattern to just match the URL, like this:
But what if there’s, say, a hyperlink in the description before the image tag? And for that matter, what if there’s more than one image tag? REGEXTGET only gets the first match, so if you really wanted the second image tag (assuming there were one), what would you do?
The good news is that regular expressions are extremely flexible, and so this weekend I updated REGEXGET to have two new optional parameters.
The first parameter is the “match index.” This is for the case when your pattern may occur more than once, and you want something besides the first one. Let’s say the description has two image URLs:
Stuff <img URL=’http://www.mysite.com/image1.jpg’ /> and more stuff. More stuff <img URL=’http://www.mysite.com/image2.jpg’ /> etc
If you want the first image URL, the rule we’ve got will work. But what if you want the second? Let’s update the rule:
REGEXGET ($description,”img URL=’[^’]+’”, 2, 1)
The 2 is the match index. It tells REGEXGET to get the second piece of text that matches the pattern…if there is one. But what’s that 1 for, after the 2? That’s the “group index.” If you’ve done some serious regex stuff, you’re probably familiar with capture groups and back references. Capture groups are a way to mark off parts of your pattern as “reusable” chunks. The way you do it is with parentheses, and it’s how we’re going to pull just the URL out of the image tag. Let’s update the rule one more time:
REGEXGET ($description,”img URL=’([^’]+)’”, 2, 2)
Two things are different. First, I put parentheses in the pattern; notice how they are just inside the quotes, which means they surround the part of the pattern that matches the URL. Next, I changed that last input parameter from a 1 to a 2. This means to get the second capture group. Well, the first one is always the entire thing that matched the pattern. The second is just the part that matched what’s in parentheses; in this case, the second image URL.
I had a big long list of people to thank, who helped make this happen, but the music is playing now and I have to get off stage. Just one reminder: this change will be released with our next update, in March.
For more information on Business Rules, check out the SSC guide to Using Text Functions Business Rules.
Blogpost by Anthony Alford, The Feed Doctor