OK, so a number of you may have noticed some major gumming up of PlanetMySQL over the last week or so. Some feeds were not appearing at all, and yet other feeds were inserting entries that had nothing to do with MySQL — entries that previously never made it into the PlanetMySQL database. Here’s the story on what happened, how I screwed everything up, and how I figured out what was going on…
The Backstory
All of the problems stemmed from a well-intentioned effort on my part to figure out why certain feeds weren’t being properly aggregated by the PlanetMySQL feed reader. Arjen wrote the PlanetMySQL feed reader originally, back in 2005, and it has performed admirably up until this point. The feed reader uses the Magpie RSS library for its grunt work, and has a filtering system that allows the Planet administrators to assign a regular expression filter to a specific feed so that entries not related to MySQL don’t get inserted into the Planet database.
The database schema is fairly straightforward. There are three main tables that relate to each other: feeds, entries, and filters. Each feed may be marked with a specific filter to apply to that feed’s entries, which Magpie fetches. For instance, Brian’s blog feed: http://www.livejournal.com/users/krow/data/rss, has the following filter applied to it:
=(database|dbms|mysql|myisam|innodb|innobase|connector/|maxdb|eventum|software ?patent|swpat|ffii)=i. In addition, each filter can be either a negative filter — meaning that entries which match the expression should be excluded — or a positive filter — meaning entries must have a match for the regular expression in either the tags (categories), title, or content of the entry.
Problem #1
OK, so I went to work trying to figure out why certain feed entries were being rejected by the filtering system that shouldn’t have been. In the walkabout.php script, there was little in the way of logging output; just a few printouts when errors occurred. But, we wanted a bit more information about the content (and filtering) of entries, and so we added the following to the script:
if ($feed_filter_negate) { /* * Here we have a filter where if the content, tags, or title * DO match something in the regex, we filter out the entry */ if ($tags_triggers_filter || $title_triggers_filter || $content_triggers_filter) { print " ! Entry has been rejected during filtering with following triggers:\n"; continue; } else { print " + Entry passed filter\n"; } } else { /* * Here we have a filter which MUST match to something in the tags, * title or content, otherwise entry is discarded */ if ($tags_triggers_filter && $title_triggers_filter && $content_triggers_filter) { print " ! Entry has been rejected during filtering with following triggers:\n"; print " ! No positive regex match in tags, title OR content\n"; continue; } else { print " + Entry passed filter\n"; } } }
Anyone spot the problem with the above code? Well, it took me a while to figure out why entries that didn’t meet the regex match were showing up in the database… Turns out that the preg_match() function doesn’t follow the same return value output as other PHP function calls, notably strpos(), which return the boolean value of FALSE if no match of the pattern is found. Why is this important? Well, because the number zero (0) could be the location within a string that the searched for string is located, and therefore it is essential for the programmer to check for the literal FALSE and not the integer zero.
Well, I assumed (yeah, I know…) that preg_match() operated the same way as strpos(). In fact, it does not; instead preg_match() returns FALSE only when an error occurs in the regular expression matching. Otherwise, the function returns the integer zero to represent that no matches were found. So… the code was corrected to the following:
if ($feed_filter_negate) { /* * Here we have a filter where if the content, tags, or title * DO match something in the regex, we filter out the entry */ if ($tags_triggers_filter || $title_triggers_filter || $content_triggers_filter) { print " ! Entry has been rejected during filtering with following triggers:\n"; continue; } else { print " + Entry passed filter\n"; } } else { /* * Here we have a filter which MUST match to something in the tags, * title or content, otherwise entry is discarded */ if ($tags_triggers_filter && $title_triggers_filter && $content_triggers_filter) { print " ! Entry has been rejected during filtering with following triggers:\n"; print " ! No positive regex match in tags, title OR content\n"; continue; } else { print " + Entry passed filter\n"; } } }
and, voila, the feeds were being properly filtered again, now with logging output so I can tell why an entry makes it in to the database or not.
Problem #2
Unfortunately, due to the foul-up described above, there were now loads of blog entries in the PlanetMySQL database which shouldn’t be, and a cleanup was needed. So, I log into the planetmysql database and get to work figuring out how I can clean up the mess I’ve wrought…
My first stab at finding entries which don’t match the required filter patterns looked like this:
SELECT COUNT(*) FROM entries e JOIN feeds f ON e.feed_id=f.feed_id JOIN filters i ON f.filter_id=i.filter_id WHERE e.content REGEXP i.pattern = 0 AND e.categories REGEXP i.pattern = 0 AND e.title REGEXP i.pattern = 0 AND i.filter_id IN (1,2);
The problem with the above was soon figured out, as a count for all the entries was returned, which couldn’t possibly be correct. So, after examining the format of the REGEXP MySQL function, I figured out that I needed to strip the (=) and (=i) from the stored regular expression filter (which is used because of the preg_match() syntax. The new SQL was:
SELECT COUNT(*) FROM entries e JOIN feeds f ON e.feed_id=f.feed_id JOIN filters i ON f.filter_id=i.filter_id WHERE e.content REGEXP REPLACE(REPLACE(i.pattern,"=i",""),"=","") = 0 AND e.categories REGEXP REPLACE(REPLACE(i.pattern,"=i",""),"=","") = 0 AND e.title REGEXP REPLACE(REPLACE(i.pattern,"=i",""),"=","") = 0 AND i.filter_id IN (1,2);
539 rows were returned, which made much more sense. A quick check of the data revealed that yes, indeed , the records found by this expression had nothing to do with MySQL, and so I removed the records after copying them to a table in another database, just in case.
So, all in all, it was a good learning experience, but I apologize for the inconvenience this caused. Hopefully, this recounting can help some other poor soul in a similar situation.