Another title might be "How to cross reference links in blog content with an opml file", but I'm sure you get the idea.
The ultimate goal was to subscribe only to the people that I have linked to in the past. Since I subscribed to several aggregate feeds (at least 3 at microsoft and a couple more like dotnetjunkies.com and geekswithblogs.com), it wasn't going to be easy to track down where I found the people I have linked to. In the end, I ended up with a list of more than 6.000 links I had made, which matched around 600 different subscribers in my opml file. I was left with a not found list of around 900 links afterwards, of which are mostly the people I found via aggregate blogs, have changed their blog address over the years or have a url to their blog that I ididn't handle in my scrubbing of urls to blog urls.
Step 1: Get a local copy of the database (or at least the blog_content table)
This step was pretty boring, not much involved. I just imported the tables and their data from the web site's database to a local Sql 2005 database.
Step 2: Create RegEx function on db server to pull out the link text from any entries that have "Interesting" in the title
The goal was to extract specific text from a bunch of text - a prefect scenario to use regular expressions in a new sql server 2005 function. Two places gave me the starting point I needed:
Vineet Rao's blog entry on Working with Regular Expressions [in Sql Server 2005 functions]. I used Vineet's function (shown below) as a starting point:
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static bool RegExCompiledMatch(string pattern, string matchString)
{
return Regex.Match(matchString.TrimEnd(null), pattern.TrimEnd(null), RegexOptions.Compiled).Success;
}
His function does a good job at allowing you to test if a field in the database contains a specific regular expression, but I needed to change it a bit to return the actual text that matched and to return any multiple results found in the field. Which leads me to the second place that gave me an important piece to use: Scott Mitchell's article on Utilizing Regular Expression SubMatches. The most important piece is the regular expression that will return a hyperlink's text:
\s*((\n|.)+?)\s*
After playing around with Vineet's implementation and Scott's regex pattern I ended up with these sql functions to get the link url (by utilizing the matching groups):
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static string RegExCompiledMatchText(string pattern, string matchString, int returnGroupIndex)
{
return RegExCompiledMatchTextDelimited(pattern, matchString, returnGroupIndex, Environment.NewLine);
}
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static string RegExCompiledMatchTextDelimited(string pattern, string matchString, int returnGroupIndex, string delimiter)
{
Match m = Regex.Match(matchString.TrimEnd(null), pattern.TrimEnd(null), RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);
string temp = string.Empty;
while (m.Success)
{
GroupCollection cg = m.Groups;
if (cg.Count > 1)
{
foreach (Capture c in cg[returnGroupIndex].Captures)
{
temp += c.Value + delimiter;
}
}
m = m.NextMatch();
}
By using a sql statement like the one shown below, I could get a text file from Sql Management Studio showing the matching links on their own lines (even though some are actually in the same entry or database row):
SELECT dbo.RegExCompiledMatchText('(>\s*((\n|.)+?)\s*)"', Text, 1) AS Link
FROM blog_Content
WHERE Title LIKE '%Interesting%'
Step 3: Parse and clean up the link urls to a more likely blog url that can be matched against an opml file
After looking at the list of links returned from the Sql function, I soon realized some scrubbing was going to be needed in order to grab the actual blog url from the link url. Origninally I thought just the domain would work for the majority of the matching - but that was quickly ruled out due to the number of blogs that share the same domain. I guess I forgot, not everyone wants to rule their own domain :)
In order to scrub the links, I determined I would write a custom sql function to due some checking that would cover most of the blog url formats. The function(s) I ended up with (after a couple of iterations) is shown below, it is specifically gear towards extracting links from the given text and returns a delimited listing of some what cleaned up links that resemble the beginning of the blog url the link was to.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static string ParseBlogUrl(string matchString)
{
string pattern = @"""(http://(.*?))""";
int returnGroupIndex = 1;
Match m = Regex.Match(matchString.TrimEnd(null), pattern.TrimEnd(null), RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);
string temp = string.Empty;
while (m.Success)
{
GroupCollection cg = m.Groups;
if (cg.Count > 1)
{
foreach (Capture c in cg[returnGroupIndex].Captures)
{
temp += ParseBlogUrlFromLink(c.Value) + System.Environment.NewLine;
}
}
m = m.NextMatch();
}
return temp;
}
public static string ParseBlogUrlFromLink(string url)
{
string[] parts = url.Split('/');
string blogUrl = string.Empty;
// ignore part 0 'http'
// ignore part 1 ''
blogUrl = parts[2]; // domain
// if it's a feedburner reference, return an empty string
if (Contains(blogUrl, "feedburner"))
{
if (parts[3] == "~r")
{
return string.Empty;
}
else
{
// trim the m= off end and return
int m = parts[3].IndexOf('?');
if (m > -1)
{
blogUrl += "/" + parts[3].Substring(0, m);
return blogUrl;
}
return blogUrl += "/" + parts[3];
}
}
for (int i = 3; i < parts.Length; i++)
{
// if the part contains any of these, then we have enough to match on
if (Contains(parts[i], "archive") || Contains(parts[i], "permalink") || Contains(parts[i],"article")
|| Contains(parts[i], ".asp") || Contains(parts[i], ".htm") || Contains(parts[i], ".jsp") || Contains(parts[i], ".php") || Contains(parts[i], ".ash")
|| Contains(parts[i], ".ppt") || Contains(parts[i], ".pdf") || Contains(parts[i], ".jpg")
|| parts[i] == "2004" || parts[i] == "2005" || parts[i] == "2006" || parts[i] == "2007")
{
return blogUrl;
}
blogUrl += "/" + parts[i];
if (parts[i].ToLower() == "blog" || parts[i].ToLower() == "linkblog" || parts[i].ToLower() == "log")
{
return blogUrl;
}
}
return blogUrl;
}
public static bool Contains(string val, string searchFor)
{
return val.ToLower().IndexOf(searchFor.ToLower()) > -1;
}
What this function does is, it finds a matching link href and then converts something like jasonhaley.com/blog/10/02/2005/123.aspx to something like jasonhaley.com/blog. The resulting string would then match the begining of the rss url in an opml file like http://jasonhaley.com/blog/rss.aspx.
Step 4: Quick and Dirty step ... export Sql Management Studio's return to Excel and import back into Sql Server
This step was just a quick way to get the output into its own table. I know there are other ways, but I did it this way since I was already playing around in Sql Management studio and Excel (some interesting pivot tables).
Continued to next post ...