Archive for category PHP

PHP/MySQL Summer of Code Projects for MySQL Forge

As Giuseppe and Kaj have blogged about, the student proposal period for the Google Summer of Code is underway from today through March 31st.

The new MySQL Forge has a wiki page with suggested MySQL-related GSoC projects that a number of MySQL engineers and community team members have proposed. The projects include C/C++ projects in the MySQL server as well as other projects for the non C gurus among you, including some projects in Lua for the MySQL Proxy.

PHP Hacker? Know MySQL? Hack on the MySQL Forge for Google Summer of Code

If you love PHP and MySQL, there are other ways you can participate in the summer of code as a MySQL project submitter. I will be open-sourcing the new MySQl Forge code base shortly (only technical issues remain…). I would love to see some interest from the PHP and MySQL community in enhancing the new MySQL Forge with new functionality. Therefore, I have created a section on the Forge wiki with a list of proposed projects for GSoC students. Here is the list so far, but feel free to propose anything else you’d like to see implemented on the Forge:

  • Develop a “karma system” for Forge users that is integrated with the MySQL forums, mailing list activity, Forge wiki activity, and code snippet creation
  • Enhance the user interface so that Forge users can “watch” a project, code snippet, worklog task, etc. Preferably a nice AJAX widget similar to what the rating button currently is
  • Develop an RSS and Atom feed system for the Forge “new items” (tasks, projects, code snippets, people?)
  • Integrate PlanetMySQL with the MySQL Forge People section
    • The user “blogs on PlanetMySQL”, for instance…
  • Integrate OhLoh.net statistics into the Projects area
  • Expand test-suite for MySQL Forge and better integration with PHPUnit3

Also feel free to email me at my first name at mysql dot com if you wish to chat or brainstorm ideas on PHP and MySQL GSoC projects.

For those unaware of what the Google Summer of Code is, here is an FAQ

PHP-Quebec – Great Stuff … Oh, and the Slides

So, this past week I was in Montreal attending the annual PHP-Quebec conference. It was my first time up in Montreal and I have to say, it’s a fun place! It was great meeting so many interesting folks and my session was jam-packed and a lot of fun. :)

One of the most interesting conversations at the conference was regarding PDOv2 and what, if anything, is going on with it. Lukas Smith raised the spectre of PDOv2 (after a tiny little prod from me. ;) ) at the Database Panel discussion with me, Kitman Cheung from IBM, Kuassi Mensah from Oracle, and Bob Bernier from PostgreSQL. It seems that Microsoft and Oracle have responded to the PDO mailing list and we are awaiting some word from IBM. Kitman promised to look into it and have someone from IBM respond shortly. In general, besides the active community members in attendance, not too many folks were that gung-ho about PDOv2, and it seemed to me that the suggestion that the community should simply enhance PDOv1 with better docs, test coverage and standardization of interfaces was the best option…anyway, more on that later.

Here are the links to the slides which are identical to the slides I gave at CodeMash this past January, but I will post them anyway. Cheers, all!

For Those Who Say I Never Take Photos…

I often get accused, especially by my European friends, of not being a true geek for two reasons: I don’t carry an expensive camera with me at all times — nice Fisheye lens, Kaj! — and because I don’t indulge in video games. Well, for all you Jay-ain’t-a-true-geekers out there, I offer two photos today to prove my geekhood.


As some of you know, I live in Columbus, Ohio, with 2 dogs, 2 cats, and my lovely wife Julie. Last night, we got hammered with the most snow we’ve seen in March in over 40 years. By my estimation, there is about 18 inches of snow on the ground. The photo to the right was taken this morning, while the snow was still coming down, from my back door. You can see my garage and the ground covered in snow, but the most telling part of the photo is the amount of snow on the table on our back deck. I think there’s at least 18 inches of snow on it. At the very bottom of the photo, you see the height of the snow on the ground; that strip on the very bottom is my back door threshold…

So, what does this mean for me? A snow day? Heck no! I’m a MySQL employee and I work from home. I don’t take snow days, of course, so while my wife has closed her yoga studio, I’m snug and warm and working from home on a Saturday on my slides for next week’s presentation at PHP-Quebec on “Performance-minded MySQL for PHP Developers“.

If you’re in the Montreal area and want to go, I have two free passes to give away since Sun/MySQL is a sponsor for the event; we’re sponsoring a cocktail party and job fair on Wednesday evening. So, if you’re interested email me at my name at mysql dot com.


Which kind of leads me to my next photo… This one is of my dining room table covered with paper. What are all these pieces of paper? Well, each and every paper on the table is a separate form or agreement I needed to sign and fill out as part of becoming a Sun employee. Hmmm. I don’t think I remember having to sign so much paperwork when I joined MySQL over two years ago. Oh, how I yearn for the simple days of a small company! :)

Hope that the photo to the left doesn’t scare anyone away from working at Sun! You know, we’re hiring…

Syntax Highlighting and Allowing HTML in Comments

Like I said my last post, the new MySQL Forge commenting system is pretty slick. It gives commenters a lot of freedom in how they wish to display their comments, including syntax-highlighted code sections, while at the same time being security-conscious about XSS attacks and such. The HTMLPurifier and GeSHi PHP libraries are used in tandem to give flexibility and security at the same time.

The code to enable this is fairly short. For you PHP devs out there, here is the code that does everything for cleaning and “codifying” the comments:

  1. /**
  2.   * Highlights the text as code in the supplied language
  3.   *
  4.   * @return string The marked up code
  5.   * @param subject The text to markup
  6.   * @param language The language to use for highlighting
  7.   */
  8. public static function syntax_highlight($subject, $language) {
  9. /* Format the code with GeSHi */
  10. include_once(APP_DIR . '/opt/geshi/geshi.php');
  11. $geshi= new GeSHi($subject, $language);
  12. $geshi->enable_classes();
  13. $geshi->enable_line_numbers(GESHI_NORMAL_LINE_NUMBERS);
  14. return $geshi->parse_code();
  15. }
  16.  
  17. /**
  18.   * Returns a cleaned and syntax-highlighted string of HTML
  19.   *
  20.   * @return string Cleaned and codified text
  21.   * @param subject The text to cut into code pieces
  22.   */
  23. public static function clean_and_codify($subject) {
  24. $original= $subject;
  25. $code_pieces= array();
  26. $code_regex= '/[\[\<]code\s*(lang|language)\=[\"\'](\w+)[\"\'][\]\>]([\D\S]+?)[\[\<]\/code[\]\>]/';
  27. $code_delimiter= "CODECODECODE";
  28.  
  29. /* First split the text into code and non-code blocks */
  30. while (preg_match($code_regex, $subject, $code_matches) == 1) {
  31. $language= trim(strtolower($code_matches[2])); // 0-index is the full match
  32. $code_sample= $code_matches[3];
  33. $entire_code_string= $code_matches[0];
  34. $code_sample= str_replace("\t", " ", $code_sample); /* Replace tabs with spaces */
  35. $code_pieces[]= array('lang'=>$language
  36. , 'text'=>$code_sample);
  37. $subject= str_replace($entire_code_string, $code_delimiter, $subject);
  38. $code_matches= array(); //reset
  39. }
  40.  
  41. /*
  42.   * Assume two consecutive newlines are a paragraph.
  43.   */
  44. /* Normalize Newlines */
  45. $subject = str_replace("\r\n", "\n", $subject);
  46. $subject = str_replace("\r", "\n", $subject);
  47. $subject = preg_replace("/[\n]{2}/", "<p>", $subject);
  48.  
  49. /*
  50.   * Next, do the same thing with markup sections
  51.   * We use HTMLPurifier here for safe checks with some allowed
  52.   * tags for ease of use
  53.   */
  54. include_once(APP_DIR . '/opt/htmlpurifier/library/HTMLPurifier.auto.php');
  55. $config = HTMLPurifier_Config::createDefault();
  56. $config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
  57. $config->set('HTML', 'AllowedElements', 'a,em,blockquote,p,code,pre,strong,b');
  58. $config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
  59. $config->set('HTML', 'TidyLevel', 'light'); // should be enough since we don't allow many elements. this really just cleans up dangling elements...
  60. $purifier= new HTMLPurifier();
  61. $subject= $purifier->purify($subject, $config);
  62.  
  63. /*
  64.   * Now $subject should contain CleanMarkup\n|||CODE|||\nCleanMarkup...
  65.   * We now replace the code sections by passing an executable string
  66.   * to the regex parser (the /e option) and using the syntax_highlight
  67.   * function to do the grunt work
  68.   */
  69. $num_code_pieces= count($code_pieces);
  70. $i= 0;
  71. if ($num_code_pieces > 0) {
  72. $replacement= "TextDecorator::syntax_highlight(trim(\$code_pieces[\$i]['text'], \"\r\n \"), \$code_pieces[\$i++]['lang']);";
  73. $subject= preg_replace('/' . $code_delimiter . '/e', $replacement, $subject);
  74. }
  75. return $subject;
  76. }

The code above comes from a TextDecorator class in the Forge code. The GeSHi ad HTMLPurifier libraries do most of the grunt work. The trick in the above code is two-fold. First, I’m pre-processing the code section blocks and storing the blocks in an array and replacing the blocks with a delimiter. I do this so that I don’t run the code section blocks through HTMLPurifier, which would scramble it entirely. Then, after replacing and storing the code section blocks, I run the rest of the comment text through HTMLPurifier, allowing a few benign HTML tags so that comments can be “pretty” and that quotations can be cited.

Finally, I use the preg_replace function with the /e modifier. The /e modifier allows me to run PHP code against matched elements. I am matching against the delimiter that replaced code sections in the first part of the TextDecorator::clean_and_codify() section. The code that is executed for each match is TextDecorator::syntax_highlight(trim(\$code_pieces[\$i]['text'], \"\r\n \"), \$code_pieces[\$i++]['lang']). The trick in this is that I am highlighting the code stored in the $code_pieces array and incrementing the $i variable at the same time, meaning each successive execution of the code will highlight the next element in the $code_pieces array…

It should be no surprise that the above code is relatively expensive to execute due to the multiple libraries involved and the multiple regex matching. Therefore, what I do when a comment is posted is store both the original text and the cleaned/codified text in the MySQL database table. On presentation of the comment text, I simply display the pre-processed text.

I’d like to thank Edward Yang and Nigel McNie, the authors of HTMLPurifier and GeSHi, respectively, for their amazing libraries. Hope this helps others looking for a clean solution to this problem!

A New MySQL Forge is Born


As anyone close to me can attest, my life over the past six months or so has revolved consistently around planning for the 2008 MySQL Conference and Expo. However, in my abundant free time not reviewing proposals, working with sponsors, and organizing the conference events, I’ve also been working on a new version of the MySQL Forge platform. In fact, the new code base is almost a complete rewrite from the original one. I’m pretty pleased with the new platform, and Giuseppe and I have been doing a bunch of testing to get things to where they are now. And we’d like your help in the final testing of Forge 2.0 before it hits the mean streets of the Internet.

So, what exactly is new about Forge 2.0? Well, lots of things actually. The old forge code base suffered from a number of problems which the new code base attempts to solve. We’ve made the new platform a whole lot more interactive and, hopefully, more attractive. Users have the ability to vote, comment, and tag virtually everything on the Forge. There is a new unified commenting system that allows you to “prettify” your comments and highlight code sections in 86 different languages. And a whole lot more.

Commenting, Revisited

One of the things I really wanted was the ability for people to comment on just about anything on the Forge, whether it was a worklog task, a project, a code snippet, whatever. And, I wanted people to not be restricted to how they presented their comments. I wanted the commenter to be able to place code snippets into the comment and get that code highlighted in whatever programming language they wanted, and have the ability to put links, bold and italicize text, cite quotations, too. But, with such ability comes the risk of security holes, XSS attack vectors, and other nasties.


Well, I’m happy to say that the solution I came up with works pretty well and enables people to do most of what they want to do in comments while still being safe for web-server consumption. I used Edward Yang’s phenomenal HTMLPurifier library to enable clean markup of the comments and the fantastic GeSHi library for programming language syntax highlighting. Together, the two libraries, and a little bit of custom code on my end, work as a great combo to allow users freedom but protect the site and users at the same time.

For those interested in the code that actually does all this, I have written a post about it.

Rock the Vote

There is now the ability to vote on things you like or find important. You can rate projects and code snippets and vote for the worklog tasks that you feel are most important. I believe it is the most important new feature of the Forge. The reason is that, particularly for feature requests that are logged in our public worklog system, there isn’t currently a good way of seeing which features the community feels are most important. Feedback from the community is really important to us, and this is a good way of getting that high-level feedback about upcoming features and important bugs.

A New “People” Section

We also wanted the new Forge to be a true community-oriented place for MySQL ecosystem. And so, a new People section now gives Forge users the ability to view and search other MySQL community members. There are lots of plans for the People section, including a kind of integrated “karma” system. Currently, you can view what the Forge user has contributed to the Forge — projects they registered, code snippets they contributed, and forge wiki pages they’ve edited. Simply click on a Forge user’s name anywhere in the Forge and you’ll jump to their public profile.

Unified Wiki Login and Looks, Finally

The MySQL Forge uses MediaWiki for it’s underlying Wiki. To say that bending MediaWiki to your will is difficult would be the understatement of the year. To be honest, it’s been a pain in my *$% over the past six months trying to get it to both conform to the overall style of the main Forge and to plug in the regular MySQL Forge authentication system in place of the default MediaWiki authentication/login system.

But, thanks to hours of digging around the Internet and frustratingly going through the MediaWiki code base, I managed to modify the Forge wiki so that it didn’t look like a pile of dung (like the old wiki does). The navigation at the top of the main Forge now is available on the wiki, too (imagine that!) and the style of the wiki itself meshes nicely with the main Forge styling, too. Getting this done required making a new Skin class for MediaWiki, based on the old CologneBlue skin, and modding it up substantially.

In addition to the style problems, one of the biggest gripes people have with the old Forge is the fact that there are two logins — one for the main Forge and another for MediaWiki. This, to say the least, is extremely annoying. In fact, it would have never been the case if, up until recently, we could actually figure out how to override MediaWiki’s authentication system. We tried a few times but never could get it to work. Well, that has changed. Now, the authentication system of MediaWiki is taken over by the main Forge authentication system.

Previously, the authentication system hooked into the main mysql.com web authentication system, but for security reasons, it was decided to move away from this and have Forge users authenticate to its own system (there are plans to have OpenID authentication in the near future, too). No more different wiki user logins, now. You log in using your email address and password and changes to your main Forge account are synced with the wiki authentication system. Finally.

Site-wide Search


Finally, one thing the old Forge lacked is a way to search the entire Forge — projects, code snippets, worklogs, and wiki pages — through a single, unified search interface. Well, now there’s a new search interface for the entire site, which yields results of all categories of information. The screenshot below shows the search results for “backup”. The search is quite simple at the moment, so I have plans to integrate the Sphinx search engine into the Forge to provide better search capabilities in the future. It’s one of the reasons I’m looking forward to the MySQL Conference. There is a session and a Birds of a Feather session on Sphinx that I plan on going to in order to learn how best to integrate Sphinx into the Forge.

So Help Us Test the New Forge!

We need help in testing the new Forge! Currently, the new Forge is running on a test database (taken from a dump of the real one today). Feel free to hammer it, crack it, add stuff, log in, log out, and all the other things you’d do on the normal forge and all the new things you can do on the new one! Report any bugs to http://bugs.mysql.com. There is a bug category for “MySQL Forge”. Keep in mind there are a few known issues which we will correct as soon as possible, so check there before reporting any bugs! :)

Oh, and one final thing. I’ll be releasing the entire source code for MySQL Forge and the underlying library called HiPPO this week on a bzr repository hosted on…what else? MySQL Forge. How sweetly recursive.

It’s About the Product, Silly

Today there was a recent flurry of blog posts, starting with Charles Babcock’s interview of Jonathan Schwartz about Sun’s strategy of targeting Web 2.0 developers. This brought to light an interesting topic about open source development communities, the perceived insularity of Sun towards the external OpenSolaris developer community, and why Linux will apparently always be more popular and technically stronger than OpenSolaris.

The initial interview led Amanda McPherson of the Linux Foundation to take issue, and long comments on those posts from supporters and objectors, shed light on a rift between OpenSolaris insiders, Linux community developers, and Sun’s overall marketing approach around open source.

Amid various snits about Sun’s “fluffy marketing” practices[1] towards OpenSolaris, there was a defection of a high-level OpenSolaris developer, Roy Fielding. The defection ostensibly was due to Roy’s frustration that Sun was not living up to its promises of a truly open development model and community, a frustration that Steven O’Grady finds uninspiring.

Unfortunately, these conversations are really just a sideshow of personal and anti-corporate rhetoric that miss the underlying truth about open source (and indeed closed source) products/projects: it’s not about the marketing, it’s about the product, silly.

I’m sure my colleagues of the marketing persuasion may take issue with that statement, so let me elaborate why I believe this to be true, and why I think that OpenSolaris has a chance of becoming a true competitor to Linux in the future — and that chance has absolutely nothing to do with Sun’s OpenSolaris marketing team (sorry, folks.).

It’s All About the Product (Quality)

As much as developer communities[2] like to think of themselves as the purveyors of a grand vision of truth towards the user community, all developer communities are dwarfed by their corresponding user communities. Members of a user community are unlikely to be aware of discussions within the developer community nor care about the debates within that community.

User communities use the product. It is the product which counts, and nothing more. And the user community will continue to use a product as long as other competing products are inferior in three basic ways:

  • Ease of use
  • Performance
  • Stability/reliability

Before you say “whoah, doesn’t licensing count, too?!”. Sure, it does, but I group licensing in with “Ease of Use”. For some groups, a BSD-style license is “easier to use” because of a perceived viral effect of the GPL on their own code (especially for embedded products). For other groups, the GPL-style licenses are easier to use because they provide certain safeguards and benefits for their own products or environments.

So, my point is that it is not the marketing of a product that counts — indeed, false marketing statements or tactics can backfire quickly. What counts is the product itself. Marketing should be the function of promoting the innovative advantages of a product in the above three categories.

Likewise, developer communities should be about improving a product in these three ways. And, NO, “features” is not a differentiation that the user community (or customers) use in picking one competing product over another. Features that deliver better ease of use, faster performance, or more stability are what can drive differentiation. Features that do none of the above are a waste of space.

Why Open Development Models Improve Product Quality

An open development model and community is simply a method of attaining better product quality. Sure, happy developer communities translate to happy user communities — happy developers make better code and better code is enjoyed by users. But simply having a “community”, as the OpenSolaris marketing team would have you believe, does not magically make a good product. A community can take many shapes. It is my belief that a more open development community does make a better product. Why?

Increased Modularization of the Core Product

In my estimation, there are many reasons for open communities producing better products than closed communities. First and foremost amongst these reasons is the first evolutionary process which occurs when development of a product is opened up in any way: increased modularization and pluggability of the core product.

The vast majority of open source projects which have a truly open development community exhibit much higher architectural modularization than projects with closed development communities. Apache, Eclipse, Linux, PHP — these are all projects with open development communities and they each exhibit extremely high levels of modularity. Why? Because modularization enables development to continue/begin on a specific feature or part of the codebase without affecting or hindering the development of the core of the product. This enables all sorts of things from easier versioning and release practices to an increase in external development of needed features.

Contrast the architecture of these projects with the architecture of a closed development project such as MySQL. The architectural differences are sometimes stunning, especially in early versions of MySQL. When I wrote the System Internals chapter in Pro MySQL in 2005, I wrote:

MySQL’s architecture consists of a web of interrelated function sets, which work together to
fulfill the various needs of the database server. A number of authors have implied that these
function sets are indeed components, or entirely encapsulated packages; however, there is
little evidence in the source code that this is the case.

I wrote that chapter against the MySQL 5.0.2 code base. Admittedly, lots of improvements have made it into MySQL since those days, but much of the same interweaving of subsystems still exists, and complicates the code base substantially. I believe that if an open development model had been instituted at MySQL years ago, we would have a very different and much more modular core database kernel than we have today. Heck, the embedded library might even work correctly…

Modularization Leads Directly To Innovation

So, besides enabling cleaner release management and increased development of external features, why is a modular product better than one that isn’t? This all goes back to the “Big Three Basics”: Ease of use, performance, and stability. Let’s tackle the last one first.

Increased Stability

Why would modularity lead to an increase in stability? This can be answered quite easily. When a product’s architecture undergoes the refactoring needed to modularize, two things happen. First, a “core kernel” or runtime begins to take shape. This kernel takes shape because developers in an open development community want to be able to develop against a firm set of application programming interfaces (APIs). Programming against an API enables a developer of an extending feature to form a contract with other developers in the community: “I will write my feature to connect to other stuff in a predefined and contractual way”.

Secondly, this method &mdash programming against an API — allows the developer to develop her feature in isolation from ongoing development in other areas. Development in other areas can count on other developers’ code not interfering with their own code. In addition, work done on one feature or module can be released irrespective of the hangups or problems occurring in other parts of the code.

So why does this lead to greater product stability and reliability? Because of the need for APIs (so that other developers can code modules outside of the interference of other modules) a set of APIs begins to emerge that outline the “black box” of the core product. This standardization process of creating APIs for connecting to the core kernel acts as a stabilizing force for the product in general. The stability of the APIs, and their consistency in implementation, can be directly correlated to the stability of the underlying product as a whole. The reason for the correlation is that when changes to an API are more difficult to effect &mdash because of their corresponding effects on other code — stability increases due to fewer changes in the way that components interact with each other.

Case in point: the “pluggable” storage engine API in MySQL was introduced in early MySQL 5.0 versions. This API has changed numerous times over the past two years as different vendors (internal and external to MySQL) needed or wanted additional functionality. The API has started to stabilize now, but I see additional changes in the future. What has happened to the stability of the underlying code base of MySQL? I will leave that answer to the reader’s intuition.

Increased Performance

The reason I believe modularization leads to better performing software is because the act of modularizing identifies pieces of the software which can and should be “pulled from the core”. Pieces not central to the functioning of the core kernel are pulled out of the core product and placed, rightfully, into modules which provide additional, value-added-for-a-certain-group-which-needs-it features. By placing extra features outside of the realm of the “necessary”, the core kernel begins to take the shape of a lithe marathon runner and not the bloated monolith it once was.

You may wonder why this removal of extra features can lead to better performance of the product as a whole. By removing non-essential parts of the software into modules, the code for the core part of the software is simplified and shortened. Now, simple, shorter code isn’t necessarily faster just for being shorter (though often it can be). However, the ease of maintenance on a simplified core software kernel allows developers to focus on performance-related tasks involving that simplified kernel. Improving performance of essential runtime functionality is much easier and faster when the code isn’t littered with calls and code paths that correspond to the functionality that belongs in an external module. If performance issues in the core kernel are easier and faster to fix, then the product as a whole becomes better performing.

Secondly, the modularization of a code base means that users who do not need an array of features can use a smaller, streamlined product that only fits their needs. Often, when users can make use of a streamlined project that doesn’t contain code they don’t need or want, the resulting binary is smaller and faster.

Increased Ease of Use

How modularization increases ease of use is related to how it increases performance: by enabling developers to only focus on the module at hand, or the core itself, instead of a mix of both, the developer is freed to focus on fixing ease of use bugs and addressing usability issues. The more time a developer can spend actually doing the things that comprise her outstanding bug and feature list, and not on “how will my code affect ten other developer’s code”, the better the chances of a product’s usability becoming better.

Finally, Increased Competition Fosters Innovation

I think it’s fairly obvious that increases in stability, performance and ease of use translate into increases in innovation for the product as a whole.

My final thought on why modularization leads to increased innovation is that modularizing a product leads to competition in the developer marketplace for a specific feature. This competition spurs the developers of a competing module to do it better and faster. Case in point: the pluggable storage engine API “modularization” effort at MySQL spurred the creation of numerous competitive engines — from Paul McCullagh’s PBXT engine to SolidDB’s transactional engine to the new Falcon and Maria engines. Each engine demonstrates different characteristics, benefits and disadvantages. Without each other, the wellspring of innovation would be much drier.

Innovation Leads to Market Share

So, assuming you’ve come with me in my theory that open development communities foster modularization of a product, and that this modularization leads to more innovation for the product as a whole, then I think you will be able to make this final step easily: innovation in a product leads to greater market share.

It doesn’t matter whether the product is the result of a group of loosely-affiliated individuals or a company like Sun or MySQL, each producer of a product wants their product to have the most market share compared to competing products. It’s human nature; we just want to be popular!

Jonathan Schwartz is Right About Community, But Wrong About Why He’s Right

So, in Babcock’s interview, Jonathan Schwartz says,

Everything begins with the development of a community

Jonathan is right about everything beginning with the development of a community. However, I believe Jonathan, and a number of folks I’ve met personally at Sun, feel that simply having a community automatically makes a good and popular product. This simply isn’t true. It’s about the product, first. Then the community. A vibrant community (both developer and user) springs from the fountain of enthusiasm about a product. If that enthusiasm wanes, because of anything from the closing of the development model to the inability to get their voices heard regarding a product roadmap or architecture, then the community begins to die along with the product. It is a delicate thing, this product and community balance sheet.

This well is poisoned; the company has consumed its own future
and any pretense that the projects will ever govern themselves
(as opposed to being governed by whatever pointy-haired boss
is hiding behind the scenes) is now a joke. Sun should move on,
dissolve the charter that it currently ignores, and adopt the
governing style of MySQL. That company doesn’t pretend to let
their community participate in decisions, and yet they still
manage to satisfy most of their users. Let everyone else go
back to writing code/documentation for hire.



There’s nothing particularly wrong with that choice — it is
a perfectly valid open source model for corporations that
don’t need active community participation. IMO, the resulting
code tends to suck a lot more than community-driven projects,
but it is still open source.

The above is a quote from Roy Fielding’s email to the OpenSolaris developer’s mailing list in which, at the end, he resigns from the OpenSolaris community. I think it is telling about how delicate the balance between product and community really is. He points out that it was the broken promises about a truly open development model that was the deciding factor in his resignation. I suspect that frustration at not having a voice in the architectural decisions surrounding OpenSolaris were also a big factor, too. Pointedly, he states that MySQL has never pretended to let our community participate in decisions, but MySQL still enjoys a large and vibrant user community.

I theorize that the MySQL user community is in danger of becoming fragmented and is increasingly fragile because of the “we decide” nature of MySQL’s closed development model. I believe that MySQL achieved ubiquity and a huge user community because of this:

  1. In the beginning, it “just worked”, was fast, reliable, and easy to use.
  2. It became ubiquitous in both language packaging and Linux distributions very early on and only because of #1 above

In other words, it was about the product itself.

If the product loses touch with its roots and because of a closed development model moves further away from easy to use, fast, and stable, the user community and MySQL’s ubiquitous nature, will move away to competitors.

Again, an open development community is a method to achieve greater innovation in a product. I believe it is silly for either MySQL or Sun to make any steps that do not open up its development activity to the wider community. If openness spurs innovation of the product, closing off of development only means that the company is stifling innovation and decreasing its own market share. If a method is there to increase innovation, why disregard it?

Why OpenSolaris Could Be a Real Competitor to Linux

It’s about Innovation in the Performance, Stability and Ease of Use of a product. As Ian Murdock wrote on the OpenSolaris blog:

My basic observation, as someone
who came into the OpenSolaris community from the outside – even perhaps
from the competition – and who represents the target market this
community needs to reach was this: That the packaging and presentation
of OpenSolaris as it stands today represents a barrier to adoption and,
thus, an obstacle to growing the OpenSolaris community and bringing in
new users. To lower these barriers, OpenSolaris needs to be more than
just the code base. It needs to be a binary that users can easily
download and install to get easy access to OpenSolaris technology. Put
another way, as I said in a blog post in June, we need to have a
better answer to the question, “Where do I download OpenSolaris?”

Ian understands that a product gets adoption when innovations in ease of use, performance, and stability are demonstrated. In the above quote, he talks about the problems in ease of use that are barriers to adoption of OpenSolaris.

Similarly, there are a number of performance-related and stability-related innovations that are contained within OpenSolaris. But the product will not achieve true competitive status with Linux until the development community is opened up and innovation from the external community — in the form of a voice and control over architectural issues and design — is truly embraced at Sun.

If Sun embraces an open development model and embraces its developer community in discussions on roadmap and architectural decisions, I think OpenSolaris has an excellent chance of competing head-to-head with Linux on the innovation front. If the product is more innovative on a variety of levels, it will gain market share.

What This Means for MySQL

As Brian Aker rightly points out, MySQL has never been an open development model. But steps have been taken to open it up. The resignation of Fielding suggests that it’s not the lack of openness which frustrated him, but the promises of Sun to do so. This means that MySQL must be very careful not to over-promise on something it cannot or will not deliver regarding openness.

In addition, if we are to reverse the current course of a community-in-flux, we must embrace the fear we have of opening up our development process. We must get out of the cathedral and put up shop in the bazaar. We won’t be jumping out of the cathedral’s top window; more likely we’ll rappel down Rapunzel’s hair slowly and carefully. I see very good prospects about opening up in the future and I am excited. It will be happening in the nick of time.


Take everything I write with a grain of salt. After all, I’m just a lowly community relations manager.

Footnotes

[1] Amanda McPherson writes:

We may not have fancy Linux analyst days and mountains of spin, but it is all about the development community. Literally.

to which Mike Dolan responded in comments:

Yes, yes, and yes. This Sun nonsense needs to be called out; it’s great to see someone else actually looking into the critical details behind Sun’s fluffy marketing.

[2] I differentiate “developer community” as the developers which develop the product, as different from the “user community” which a) develops products on top of or for the product or b) simply use the product.

Answers to Questions from Today’s Performance Coding Webinar

Thanks to everyone who attended the performance coding webinar today. I think there were about 250 people online, which is a great turnout! Sorry for dropping off about fifteen minutes into the webinar — that’ll teach me to use Skype for presenting!

There were a bunch of questions from attendees and I have tried to quote them and answer to the best of my knowledge below. They are in no particular order, and where possible, I refer to the slide where I think the question referred to. Happy tuning and coding!

Miscellaneous Questions

Will MySQL Support Functional Indexes Soon?

This comes up every time! :) Not sure why…perhaps there’s lots of Oracle and DB2 users also using MySQL and wanting to migrate to MySQL without changing schemas. In any case, the short answer is no, they won’t. There is a Worklog task that describes the status of this feature request. Feel free to comment, as others have on it.

Questions on Data Types

The following questions concerned the slides which talked about data types and schema considerations

What About Using VARCHAR With a Specified Length? Is That Better?

Also:

Could You Explain Again About Created_tmp_disk_tables?

So, this question arose around the problem inherent with the VARCHAR data type when a VARCHAR column is used by the optimizer to satisfy a SELECT in which a GROUP BY or ORDER BY condition cannot be satisfied by an index. In these cases, a temporary table of the MEMORY storage engine type is created and records are pulled from the original storage engines into this temporary table and then it is sorted and grouped.

The problem with that is that the MEMORY storage engine does not support variable-length column data types. So, a VARCHAR(255) column in the permanent table which stores a 5 byte string is converted into a fixed-length 255-byte (assuming a Latin1 character set) field. If your average size of data stored in your VARCHAR(255) column is around 30 bytes and you’ve got around 1000 rows in the winnowed result set (after a WHERE condition), that works out to around 3K of memory used for that VARCHAR field. However, if a temporary table is needed to re-sort or group that intermediate set, that 3K balloons into 25K because the variable length data just got converted into fixed-length data.

Now, 1000 rows in an intermediate result set (before the extra grouping or ordering) is pretty small. If you’ve got a larger intermediate set, the variable to fixed-length conversion may easily push the temporary table memory over the max_heap_table_size or tmp_table_size configuration variables. If that happens, MySQL will convert the MEMORY table into a MyISAM disk-based table. When this happens, the Created_tmp_disk_tables status counter increases and MySQL will be doing a lot of work because of the conversion. So, this is why I am always wary when I see tables with 10-20 VARCHAR(255) columns in them (Yes, Ruby on Rails, I’m talking to you! :) )

Other reason why you might see the Created_tmp_disk_tables is if intermediate grouping/sorting results contain any TEXT or BLOB columns. The MEMORY storage engine does not support these data types and so a MyISAM temporary table will always be used instead. To demonstrate this behaviour, take a look at the following output from the mysql client:

mysql> CREATE TABLE test_mem_to_disk (
    -> id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
    -> , name VARCHAR(30) NOT NULL
    -> , description TEXT NULL
    -> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO test_mem_to_disk VALUES
    -> (NULL, 'Jay Pipes', 'Community Relations dude at MySQL')
    -> , (NULL, 'Roland Bouman', 'Certifications dude at MySQL')
    -> , (NULL, 'Baron Schwartz', 'Community Extraordinaire');
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> FLUSH STATUS;
Query OK, 0 rows affected (0.00 sec)

mysql> EXPLAIN SELECT id, name FROM test_mem_to_disk GROUP BY LEFT(name,1)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test_mem_to_disk
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
        Extra: Using temporary; Using filesort
1 row in set (0.00 sec)

mysql> SELECT id, name FROM test_mem_to_disk GROUP BY LEFT(name,1)\G
...
3 rows in set (0.00 sec)

mysql> SHOW STATUS LIKE 'Created_tmp%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 0     |
| Created_tmp_files       | 0     |
| Created_tmp_tables      | 2     |
+-------------------------+-------+
3 rows in set (0.00 sec)

I went ahead and did a GROUP BY LEFT(name, 1) to simulate a situation where a grouping is done on a non-indexed column or that an index is available but isn’t suitable for the GROUP BY. As you can see in the EXPLAIN output, the Extra column shows “Using temporary; Using filesort” which indicates to us that MySQL is creating an temporary space to deal with the intermediate result returned from the MyISAM base table (in order to do the weird grouping).

You will notice that the Created_tmp_tables counter variable increased by 2. One of these increments is due to the SHOW STATUS command itself (a little bit of a pet peeve of mine). So, you can view the number above as producing one MEMORY temporary table to deal with the intermediate grouping. Notice that above, we do not include our TEXT column (description) in the output. What happens when we include the description column in the SELECT output? See below:

mysql> FLUSH STATUS;
Query OK, 0 rows affected (0.00 sec)

mysql> EXPLAIN SELECT id, name, description FROM test_mem_to_disk GROUP BY LEFT(name,1)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test_mem_to_disk
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
        Extra: Using temporary; Using filesort
1 row in set (0.00 sec)

mysql> SELECT id, name, description FROM test_mem_to_disk GROUP BY LEFT(name,1)\G
...
3 rows in set (0.00 sec)

mysql> SHOW STATUS LIKE 'Created_tmp%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 1     |
| Created_tmp_files       | 0     |
| Created_tmp_tables      | 2     |
+-------------------------+-------+
3 rows in set (0.00 sec)

As you can see, just including a TEXT (or BLOB) field in the SELECT output forces a MyISAM disk-based table. Also notice that a MEMORY temporary table is first created, then a flush to a disk-based table is done after the TEXT column is noticed.. You have been warned.

Why Use INT UNSIGNED [for IPv4 addresses] and not INT SIGNED

Always use INT UNSIGNED as the data type for IPv4 addresses. Not CHAR(15). Not VARCHAR(15). And not INT (without UNSIGNED). The INET_ATON() function returns an UNSIGNED integer, not a signed one, as IPv4 addresses need the full positive addressing of the unsigned integer. I know this partly because I worked on the bug that fixed the INET_ATON() function which had previously been returning signed integers…

How Do You Recommend Schema Changes to Go from IPv4 Addressing Using INT UNSIGNED to IPv6 Addressing?

Brian Aker is adding support for a native IPv6 data type into MySQL 6.0. So that should be the version in the future that you use for IPv6 addresses. A version right now? Not sure… CHAR(32)? :)

Questions on the Query Cache

The following were questions asked when discussing the slides on vertical partitioning, specifically to address problems with Query Cache thrashing.

Does the Query Cache Ignore Things Like LIMIT and ORDERing?

No, it does not. Every SELECT expression, including differences in case-sensitivity, character sets, constants used, table aliases, LIMIT variables, and ORDERing will produce a different stored MYSQL_RESULT store within the Query Cache.

Can We Prevent Data From Getting Into the Query Cache?

Yes, definitely. You can:

  • use SELECT SQL_NO_CACHE ... . The query cache will ignore any SELECT with SQL_NO_CACHE in it.
  • Turn it off entirely for the server by setting the configuration variable query_cache_size equal to 0 in your my.cnf

Does the Query Cache Ignore “ORDER BY RAND()”?

Yes, it does because RAND() is a non-deterministic function. If the SELECT contains any non-deterministic function, the query cache will ignore it. Common non-deterministic functions include:

  • RAND()
  • CURRENT_DATE()
  • NOW()
  • CURRENT_USER()
  • SYSDATE()

Can We Turn Off Query Cache Per Schema Basis?

No, unfortunately this is not currently possible. Only per-server or per-statement as above.


I’ve got load more questions and answers, and I will post them all tomorrow when I get some more time.. :)

Just Chill…Chilll Out, OK? There Ain’t No Devil in PDOv2

A number of people have emailed me wondering why I haven’t blogged about the Sun/MySQL deal. Well, I’m still working out my thoughts on that, so I’ll leave it to another day. Besides, haven’t there been enough blog posts about it already?! ;)

As a PHP community member and a person who has been participating on MySQL’s behalf in the much-maligned PDOv2 working group, there is a more important and pressing topic of conversation that I’d like to comment on. Namely, the recent events surrounding the publication of the FAQ about PDOv2. There are many different topics being bandied around the PHP community schoolyard — some on-topic, some wildly off-topic and tangential. These are the issues I think represent what the majority of conversations have been about:

  • WTF? Who is this private, clandestine, devil-worshipping PDOv2 working group anyway and why wasn’t the PHP community involved in their discussions?!
  • WTF? What is this proposed Contributor License Agreement and why do we have to go over this all over again?!
  • WTF? This is only going to fracture the community into two diabolically and diametrically opposed groups!
  • WTF? This is just the database vendors trying to market their own agendas and push PHP in their own directions!

A number of emotional responses to the issues have already been made, but I write here to make a plea for calm, mature discussion about the topics at hand. It serves little purpose to be dogmatic or reactionary about this stuff, and I hope in this entry to make the case for rational, open discussion from here forward.

About the Devil-Worshipping Working Group

There have been a number of comments about how the working group composed of representatives from the database vendors, Zend and developers of PDOv1 have met in this clandestine, non-public way, and how this secretive meeting is against the principles of the PHP community, and worse, open source in general. Here is my response to these comments: Get over it.

As developers and users of PHP, we live in a world of both individuals and corporations. As such, both sides must recognize the different needs and desires of each group. The needs of large corporations and of individuals sometimes differ, and the ways in which those needs are met are often different.

One example of how those varying needs are different is the process by which legal discussions can take place. Large corporations, and their legal teams, need a different venue in which to discuss their common legal concerns — a venue which in large part must be free of the distraction of long public mailing lists. Often, the legal issues discussed in such venues are private and confidential to the companies. For instance, lawyers from various companies must be free to discuss common concerns that affect them and not the wider public audience before their stance on an issue is made public.

The discussions of the working group members up until this point have been of this nature — discussions of common concerns and agreements regarding legal ramifications of their contributions to the PDO project. It made no sense to open up these discussions on a wider mailing list. Now that such discussions are over, the working group has put together an FAQ which attempts to explain the group’s collective thoughts. The working group as it is now will cease to be the only ones discussing PDOv2 and will be just another voice in the PHP community as we address the issues surrounding CLAs, licenses, and such.

So, in short, if you are angry about not being included in the working group discussions to date: get over it and join the ongoing discussion in a useful way.

About the Need for a New PDO

Although Pierre and others have issued an emotional, albeit non-substantive, plea to ignore PDOv2, there are numerous reasons why I believe development of PDOv2 is needed, and why open discussions about the issues should proceed in a mature way.

Lack of Contributions to PDOv1

Although some in the community seem to think that community outside of contributors from database vendors would do a better job at maintaining and enhancing PDO, the evidence does not support this claim. I call out Pierre’s comment on Antony’s blog which says:

However you are definitively right, PDO needs love but will it be enough to convince their original authors to bring it at another level? to something we really need? :-)

In a perfect world, there would many contributors in the PHP development community who would be both willing and able to “show PDO some love”. But in the real world, such people have not come out of the woodwork. In fact, the developers who understand the various database vendor APIs work for the database vendors. So, if we want an improved, standardized PDO that is test-covered, the most pragmatic solution is to find a way to have employees at the database vendors able to contribute. We must begin to live in the real world, and not in a fantasy land.

PHP community members should be excited about the database vendors desire to contribute to an improved PDO, not lambasting them. Another quote from Pierre:

The biggest mistake in PDO (and the biggest mistake is being repeated again by the same persons and some newcomers) was to think they know better. They have a good understandings of DBs but sadly not of what many of us were actually looking for or about our needs.

If we could return to reality and get past strict ideology, I think you will find that the database vendors are begging for the chance to have the PHP community tell us precisely what it wants. Sure, we must get over this initial hurdle regarding the details of if, when, and where a CLA should be used, but once that is past us, we, the database vendors are eagerly anticipating the ensuing enhancements to be made to both the PDO core and our own drivers!

About the Need for Standards in PDOv2

Another reason we must press on and solve the philosophical and legal issues at hand is the burning need for database vendors to work with the PHP community on standardizing the currently messy way in which each vendor driver exposes, transfers, and writes data to and from its database. If we continue down the path of ideological crusade, we simply prolong the standardization work.

About the Need for a Metadata Interface

Let’s face it. Compared to JDBC/ODBC, the ability of PHP developers to retrieve database metadata — be it schematic, columnar, or tabular — is really poor. The reasons for this are mostly to do with the underlying drivers for each vendor. If the database vendors are locked out of participating in the development of PDOv2, expect to see little change in this regard. We need the participation of all parties to resolve differences and hammer out the kinks in the underlying libraries.

It Ain’t About Open Source, Silly

As some ideologues can tend to do, some valid concerns are often stretched beyond their limits. The current concerns about the Contributor License Agreement abolishing open source within PHP is a perfect example of this. Richard Thomas writes on Mike Willbanks’ blog:

who cares if a CLA is worthless if it makes vendors feel warm inside and gets them to contribute?

to which Larry Garfield responded:

I care, because with a CLA it’s not Free Software anymore. I’m not allowed to look at the code unless I sign an extra agreement that I won’t do… something. Sorry, not Free Software. I’m not interested.

This simply is wrong and misses the concept of a contributor license agreement.
Luckily, Mike responded with a poignant reminder to not stretch the boundaries of an argument beyond their natural conclusions:

To clarify a few things. A CLA does not mean that it would be closed source and you would have to sign a CLA before hand. Take a look at the Zend Framework for instance. You have to sign a CLA before you contribute but that certainly doesn’t make it so you can not view the source code.



Further, if the source code was not available then how would we build from the source? I think you are looking at this from the wrong viewpoint without actually reading into it fully.

Lukas Smith responded with some more common sense:

Indeed, the concern with CLA is mostly about the ability to openly participate and not about being able to read the source.

In Summary

Let us, the PHP community, the database vendors, the core developers of PHP, stay on track. The potential of PDOv2 is worth more than ideological bickering: it benefits the entire substrate of the PHP user community that we improve and enhance PDO. Let us march forward towards that goal. Yes, there are issues to be resolved — perhaps most importantly, the code and spec boundaries to which a CLA will bind — but the future is bright for data access through PDO. Let us not darken the skies with rhetoric and ideology. Let us come together and make something that is greater than the sum of its parts.

Slides from ZendCon Performance Coding Session


I presented at ZendCon this week on performance MySQL coding techniques. The session was packed, and I got a bunch of great questions from the audience. I figured I would post the slide deck in both ODP and PDF formats. Feel free to download!

Speaking at ZendCon – October 8-11 – San Francisco

src="http://s3.amazonaws.com/zendcon/ZendCon07_SpeakerBadge.gif" style="float: left; margin: 0px 20px 20px 0px;"
border="0">
I’ll be giving a 1 hour session at ZendCon this year entitled “MySQL Performance Coding – From Soup to Nuts”. The abstract says 3 hours, but I believe it’s a one hour slot… Either way, it’ll be fun and rowdy. The PHP crowd always is a fun crowd!