Syntax Highlighting and Allowing HTML in Comments

Like I said my last post, the new MySQL Forge commenting system is pretty slick. It gives commenters a lot of freedom in how they wish to display their comments, including syntax-highlighted code sections, while at the same time being security-conscious about XSS attacks and such. The HTMLPurifier and GeSHi PHP libraries are used in tandem to give flexibility and security at the same time.

The code to enable this is fairly short. For you PHP devs out there, here is the code that does everything for cleaning and “codifying” the comments:

  1. /**
  2.   * Highlights the text as code in the supplied language
  3.   *
  4.   * @return string The marked up code
  5.   * @param subject The text to markup
  6.   * @param language The language to use for highlighting
  7.   */
  8. public static function syntax_highlight($subject, $language) {
  9. /* Format the code with GeSHi */
  10. include_once(APP_DIR . '/opt/geshi/geshi.php');
  11. $geshi= new GeSHi($subject, $language);
  12. $geshi->enable_classes();
  13. $geshi->enable_line_numbers(GESHI_NORMAL_LINE_NUMBERS);
  14. return $geshi->parse_code();
  15. }
  17. /**
  18.   * Returns a cleaned and syntax-highlighted string of HTML
  19.   *
  20.   * @return string Cleaned and codified text
  21.   * @param subject The text to cut into code pieces
  22.   */
  23. public static function clean_and_codify($subject) {
  24. $original= $subject;
  25. $code_pieces= array();
  26. $code_regex= '/[\[\<]code\s*(lang|language)\=[\"\'](\w+)[\"\'][\]\>]([\D\S]+?)[\[\<]\/code[\]\>]/';
  27. $code_delimiter= "CODECODECODE";
  29. /* First split the text into code and non-code blocks */
  30. while (preg_match($code_regex, $subject, $code_matches) == 1) {
  31. $language= trim(strtolower($code_matches[2])); // 0-index is the full match
  32. $code_sample= $code_matches[3];
  33. $entire_code_string= $code_matches[0];
  34. $code_sample= str_replace("\t", " ", $code_sample); /* Replace tabs with spaces */
  35. $code_pieces[]= array('lang'=>$language
  36. , 'text'=>$code_sample);
  37. $subject= str_replace($entire_code_string, $code_delimiter, $subject);
  38. $code_matches= array(); //reset
  39. }
  41. /*
  42.   * Assume two consecutive newlines are a paragraph.
  43.   */
  44. /* Normalize Newlines */
  45. $subject = str_replace("\r\n", "\n", $subject);
  46. $subject = str_replace("\r", "\n", $subject);
  47. $subject = preg_replace("/[\n]{2}/", "<p>", $subject);
  49. /*
  50.   * Next, do the same thing with markup sections
  51.   * We use HTMLPurifier here for safe checks with some allowed
  52.   * tags for ease of use
  53.   */
  54. include_once(APP_DIR . '/opt/htmlpurifier/library/');
  55. $config = HTMLPurifier_Config::createDefault();
  56. $config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
  57. $config->set('HTML', 'AllowedElements', 'a,em,blockquote,p,code,pre,strong,b');
  58. $config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
  59. $config->set('HTML', 'TidyLevel', 'light'); // should be enough since we don't allow many elements. this really just cleans up dangling elements...
  60. $purifier= new HTMLPurifier();
  61. $subject= $purifier->purify($subject, $config);
  63. /*
  64.   * Now $subject should contain CleanMarkup\n|||CODE|||\nCleanMarkup...
  65.   * We now replace the code sections by passing an executable string
  66.   * to the regex parser (the /e option) and using the syntax_highlight
  67.   * function to do the grunt work
  68.   */
  69. $num_code_pieces= count($code_pieces);
  70. $i= 0;
  71. if ($num_code_pieces > 0) {
  72. $replacement= "TextDecorator::syntax_highlight(trim(\$code_pieces[\$i]['text'], \"\r\n \"), \$code_pieces[\$i++]['lang']);";
  73. $subject= preg_replace('/' . $code_delimiter . '/e', $replacement, $subject);
  74. }
  75. return $subject;
  76. }

The code above comes from a TextDecorator class in the Forge code. The GeSHi ad HTMLPurifier libraries do most of the grunt work. The trick in the above code is two-fold. First, I’m pre-processing the code section blocks and storing the blocks in an array and replacing the blocks with a delimiter. I do this so that I don’t run the code section blocks through HTMLPurifier, which would scramble it entirely. Then, after replacing and storing the code section blocks, I run the rest of the comment text through HTMLPurifier, allowing a few benign HTML tags so that comments can be “pretty” and that quotations can be cited.

Finally, I use the preg_replace function with the /e modifier. The /e modifier allows me to run PHP code against matched elements. I am matching against the delimiter that replaced code sections in the first part of the TextDecorator::clean_and_codify() section. The code that is executed for each match is TextDecorator::syntax_highlight(trim(\$code_pieces[\$i]['text'], \"\r\n \"), \$code_pieces[\$i++]['lang']). The trick in this is that I am highlighting the code stored in the $code_pieces array and incrementing the $i variable at the same time, meaning each successive execution of the code will highlight the next element in the $code_pieces array…

It should be no surprise that the above code is relatively expensive to execute due to the multiple libraries involved and the multiple regex matching. Therefore, what I do when a comment is posted is store both the original text and the cleaned/codified text in the MySQL database table. On presentation of the comment text, I simply display the pre-processed text.

I’d like to thank Edward Yang and Nigel McNie, the authors of HTMLPurifier and GeSHi, respectively, for their amazing libraries. Hope this helps others looking for a clean solution to this problem!