I had prevously made a couple of half hearted attempts to learn the R language without much progress. Then inspired by my own Python in 48 hours and Ruby in 24 hours efforts, I decided to go on a crash course in R today. My interest in R is mostly because we have a lot data at road.lk that I would like to look at through different eyes. I also have access to a lot of scrabble tournament data that need analysis. The latter can certainly be done most easily with python but to be a good it never does any harm to have a lot of different weapons in your Armoury. So it's 7:31 in the morning of August 7th. Lets start.
The first stop was the Coursera R lang course, but lecturers aren't my thing. So next stop was the Datacamp courses but they seem to be too trivial. Can't blame them cause their target audience probably isn't hard core programmers but data scientists. Next stop was the r-projects documentation, but that's where I got lost in the maze. There were too many to choose from. So as at 5:49 am in the 9th of August, I haven't gone anywhere. The fact that I couldn't devote my full attention to R didn't help either. So this is going to need another effort next week.
Internet Scrabble Club is the cool place to play scrabble online. Some of the world's top players hang around at the ISC. Play online doesn't mean you can play with the browser, rather you need to download and install one of their clients. Fortunately they have clients available for Linux, Mac and Windows. The only trouble is that the linux client freezes at the slightest excuse. You can never seem to play more than one more before it either disconnects or stops responding at all. The obvious solution then is to try running it under wine.
The windows exe running under wine doesn't even get as far as that. It fails at the login screen. The dialog box to enter the username and password is permanently damaged and wants repainting but it never seems to happen. So it was time to look back at the java version of wordbuff and it wasn't long before I figured out that the problem was because Open JDK and wordbuff doesnt' see eye to eye. It runs easily with the Oracle JVM.
/usr/local/java/jdk1.7.0_65/bin/java -jar wordbiz.jar
Ok a few footnotes:
The client software is named wordbuff. Does that have anything to do with Derek McKenzie who runs the popular word-buff.com website?
Wordbuff seems to save the username and password in a clear text file named Config
ISC seems to save the password in clear text on their server (if you try the password reset, they send you your old password back in a plain text email)
This blog now runs on Google App Engine with the NDB serving as the storage backend. In the process of writing the (python) code I ran into some complications regarding how it uses indexes. First we will consider a query on one of the wordpress tables in a mysql database.
explain select * from wp_posts where post_type='post' and comment_count = 0;
explain select * from wp_posts where comment_count = 0 and post_type='post';
The comments_count column does not have an index so mysql query optimizer is smart enough to figure out that the 'type_status_date' index is the right one to use for both queries.
+----+-------------+----------+------+------------------+------------------+---------+-------+------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------+------+------------------+------------------+---------+-------+------+------------------------------------+ | 1 | SIMPLE | wp_posts | ref | type_status_date | type_status_date | 62 | const | 974 | Using index condition; Using where | +----+-------------+----------+------+------------------+------------------+---------+-------+------+------------------------------------+ 1 row in set (0.38 sec)
Now what if you tried to do a similar query on NDB? You would need indexes on both columns to begin with and you would also find that NDB expect two different indexes for these queries. One would take the form Index(comment_count, post_type) while the other would be Index(post_type, comment_count). Welome to the world of noSQL where no one really cares too much about storage. But if you are a GAE user you should care because you are paying for it as well as the read and write operations to the data store. More indexes you have higher the write cost. Unless you plan your queries judiciously you will find that they soon baloon out of hand but you are limited to 200 indexes max!
Flipped the switch. Finally. Wordpress has been switched off, as I have threatened to do so many times before and in fact claimed to have done at least twice! But this time it's real. This blog post is being written with the aid of a home made blogging system running on Google App Engine (python). The data is stored in NDB and the editor is CKeditor, with the comments being powered by Disqus.
So what about the previous claim for switching to jekyll. Well I did do so for the photoblog and I did do all the hard work for this one as well, but then for no apparent reason the whole damn thing stopped working.
I couldn't be bothered to find out what caused it, particularly since Ruby isn't in my repotoire. So I said the hell with it and stopped blogging for a few months and then suddenly decided to come up with this system. It's said that the path to Django mastery lies through building your own blogging platform. Well I can well and truely say that I have done that now.
The system hasn't been without a few teething problems. The feed link broke and resulted in a lot of being being sent a bunch of notifications about new posts even when there weren't any (very sorry about that). And there are still a few 404 errors scattered about here and there (these will be fixed soon).
Shut up and just give me the code? Ok, look at the bottom of this post.
What is the slowest query on road.lk? It's the select count(*) on the traffic alerts table that Django admin insists on executing each time someone visit the admin page for the alerts. Though we are fully operational only in Sri Lanka and Bangladesh at the moment, we do have twitter stream parsers running for all countries since a few months back. As a result, at the time of writing this blog post the traffic alerts table contains 762,779 entries. That's a number that Postgres can handle easily without breaking a sweat; as long as you don't do a select count(*) on it. In fact Postgres is not the only database that has trouble with these unqualified count queries. Mysql does even worse.
This cannot exactly be described as a bug but it certainly is an issue that has been plaguing django for five years. On the bug tracker it's perhaps rightly categorized as 'wont fix', because to fix it would be to break the paginator which is used in most places. But this can be solved easily by caching the page nav tabs as a template fragment right? Wrong! If you overide the default change_list template, you will see that the paging section does appear to be cached, yet surpsingly, the debug tool bar and the pgsql slow query log reveals that the query is still being executed.
Looks like creating a custom paginator is the only way forward. In the end it turned out to be very easy. All that's needed is to grab the code for the default paginator and change just a few lines of code.
This site does not use server side includes. Banish the thought! However this site does use PHP as if it was SSI. The site with the LAMP was first created in 2002, at that time CMS weren't as advanced as they are now. In spite of the millions of hours that have been invested in CMS by so many developers it's still a load of crap, so you can imagine what it was like at that time (or you might have had the misfortune of using a CMS at the turn of the century). You can read my rant from 2005 here.
To cut a long story short, the part of the site that's not under wordpress is created by using using PHP to each out the headers and footers and what not. Which means there is are a few lines of PHP code at the top of each page, a few lines a the bottom and in between it's all HTML. Initially I toyed with the idea of using a regex to convert this mess to markdown, but regex sucks. It it dawned on me that modifying the PHP code that acts as the template to generate markdown instead of HTML would be far easier, and that's exactly what I did.
Jekyll supports both pages and posts so the non blog part of the site can easily be converted to jekyll pages. As to how to deal with the Wordpress powered blog; well dozens of experiments were done and it's all covered here in the archives.
Earlier I mentioned running across a plugin that did both categories and rss. Turns out that the RSS generated is for the categories which I am not using a the moment and I am not sure if I want to add it either. Besides Disqus comes with it's own feed for each comment thread so it's going to be very confusing to the visitor to figure out which RSS feed to subscribe to. The simple solution is to just place a rss template file at the root level. Being lazy; I copied and pasted from this template.
Having done all this it was time to flip the switch - but on the photo blog. Yes, folks, all this time I have been concentrating on it because it's so much simpler and doesn't even have half the posts in this blog. The photo above is what it used to look like. Click through to the site, to see what it looks like now.
Windows has something called ReadyBoost - us linux users have been having half of what Readyboost provides for as long as flash drives have been in existence. That is we have been able to use a flash drive as a swap file for a long long time which is essentially what ReadyBoost does. But there is something more.
Part of ReadyBoost is the ability to store some of the frequently used data on disk on a flash drive. In other words, the flash drive acts as a disk buffer. Again this is nothing new, the linux kernel automatically uses all available RAM as a disk buffer. When applications ask for more memory the buffers are reduced. MS DOS also used to have disk buffering systems but they fell by the way side when Windows came into being before making a come back.
What the mainstream linux kernel doesn't have is the ability to use a flash drive as a buffer. Fortunately there is a third party kernel module called dm-cache that does provide this facility. What's sad is that beaurocracy has kept it out of the mainstream kernel for close upon two years.
The fact that the module hasn't been included in the main source tree means it hasn't been maintained. So the version that's downloadable from github is compatible only with kernel versions upto 2.6.29. Let's see if there is a patch somewhere.