• Compressed File Systems on Linux

    Perhaps the title should have been; ‘the lack of a suitable compressed file system on linux’. A compressed file system in this case refers to a setup where the files are saved on the disk in a predefined compressed format (such as gzip or bzip2). When you read from those files they will be automatically decompressed by the file system. Similarly when you attempt to create a  new file or modify an old file, it should be automatically compressed before saving. Such a file system is sure to be very slow for random access but for sequential access it wouldn’t matter so much. It might even be faster than an uncompressed file system because hard drives continues to be the real bottleneck in most computers today.

    Linux gives you two options for creating file systems; at the kernel level or in the user space.  e2compr is a kernel patch that supports compressing an ext2 file system while the user space tools are based on fuse. Using kernel space drivers could be messy if the file system chosen like e2compr is always playing catchup with the kernel version. The only compressed file systems supported in the mainstream kernel are jffs2 and squashfs.  The former is for flash drives and the latter is ready only. Though JFS is available for linux, it’s compression mode is not.  While both KDE’s Dolphin and Gnome’s Nautilus support mounting a gzipped tar archive of a zip file as a node in the file system in read/write mode, they are not suitable for web apps. So fuse seems  the way to go.

    There are eight file systems listed on the Fuse compressed file system page. Four are read only, two are abandoned, two more are heading down that route. One is in it’s early development stage. The remainder was FuseCompress.

    This one is real easy to install just do ‘yum install fusecompress’ setup is even easier.

    fusecompress  /mnt/compressed   /mnt/uncompressed

    Copying over a set of 30,000 files (a mix of text and binary data) with a total weight of 85MB took 36 seconds. The same operation when repeated, completed in 20 seconds because of file system caching.  When the files were copied without compression the operation completed in 1 second, so much for my assumption that a compressed system would be faster.

    Tuesday, March 23rd, 2010 at 14:23
  • Mysql Sphinx Storage Engine

    I have blogged about Sphinx in that past, but that was about CMU Sphinx, a Speech Recognition Engine, this Sphinx is different. This sphinx is a Search Engine. Pretty good one too. The Sphinx search daemon has a mysql compatible API, which means you can use your trusty old mysql console or query browser to do a search. You can even use the PHP mysql API if you wanted to, but there is no need to do that because PHP has support for the default Sphinx protocol. However if your preferred programming language didn’t include support for Sphinx you could still use the mysql compatible mode.

    The sphinx index isn’t really a mySQL database, so you cannot join it to your other tables. That’s where the Sphinx storage engine comes into the picture. It allows you to create a table in your existing databases to represent the Sphinx index. That would allow you to actually use the sphinx data in a join or sub query. You might ask why goto all this trouble? Why not just use full text? Full text search has it’s limitations, most notable being how slow it is when you have large table. A full text search on a table with 8 million rows (which is about as much as we have on Pitupasa.com at the moment) will take 10-15 seconds to complete. Sphinx on the other hand returns the results in a fraction of a second.

    At first glance, it seems that enabling the sphinx storage engine is kidstuff. Just download the sphinx and mysql source tarballs, copy a set of files from sphinx into the mysql source tree, do a BUILD/autorun.sh followed by ./configure –with-plugins=sphinx  (see full instructions at http://www.sphinxsearch.com/docs/current.html#sphinxse-mysql50 ). Except that it doesn’t work.

    configure error: unknown plugin: sphinx

    Yes, I know what I am doing. No there weren’t any typos. I am not the first one to run into it. The others who ran into this problem don’t seem to have found the solution either. Then I ran into another set of instructions at http://www.sphinxsearch.com/wiki/doku.php?id=sphinx_sphinxse_on_rhel there the instruction is to use ‘-with-plugin’ but shouldn’t that be ‘–with-plugins’ ? since I have already tried ‘–with-plugins’ I tried ‘–with-plugin’ (note the s). I tried ‘–with-plugin’ and the compiler didn’t complain about the missing sphinx plugin as above. But still the sources in the sphinx folder were not getting built. By the way, ‘configure –help’ shows the correct parameter to be –with-plugins. But then none of these things will result in the sphinx storage engine being built.

    The docs refer to a prepatched mysql source tarball, unfortunatley it appears to have been taken off line. There is a patch for mysql 5.0.x but I don’t want to use that because I want to be able to partition the table and indexes (a feature that is not available in mysql 5.0.x). Oh well looks like it’s time to look at mnogosearch.

    Sunday, March 14th, 2010 at 11:52
  • Centos 5.4 , PHP 5.3 and Harvard Referencing.

    A couple of days back, we did an update to Deadlinedue, the Harvard reference generator, the moment the database was updated and the new code was put in place, PHP started segfaulting. It was time to decide whether to roll back or press forward. I chose the latter and it resulted in the site been offline for around an hour.

    It is not possible to assign complex types to nodes in /var/www/deadline/ISBN.php on line 480, referer: http://deadlinedue.com/index.php?lookf
    or=http%3A%2F%2Fraditha.com&find=find
    *** glibc detected *** /usr/sbin/httpd: double free or corruption (fasttop): 0×81cdfc78 ***

    PHP is pretty old now, it’s not something that you expect to segfault, so initially I thought the culprit would be APC – Advanced PHP Cache. These PHP accelerators or caches are known to crash every once in a while. That can be easily fixed by removing the accelerator. In this case we could afford to do so since we had upgraded the server very recently and optimized the code which resulted in a speed boost and reduced memory usage. I was barking up the wrong tree,  disabling APC didn’t do any good. So guess it’s time to upgrade PHP?

    Now you might ask, shouldn’t we have used the same version of PHP in our development and production servers to ensure that this sort of thing didn’t happen? Right you are but who expects PHP of all things to crash like this. It’s really rediculous that deadlinedue is hosted in the cloud. So we could easily have made a snapshot of it and started another server in less than 5 minute. We could have tested with that guinea pig and then gone live, but no, I was too cocky.

    This is Centos 5.4 and there are no RPMs available for PHP 5.3.1, So it was time to compile from scratch. Which usually means you need to run ./configure about half a dozen times, fixing each of the missing dependencies it reports until it completes without error. Fortunately there is plenty of bandwidth on Amazon EC2 and the servers are blazing fast. Most deps can be installed in less time than it takes to type out the yum install command.

    After all this the problem still wasn’t solved, a lot of modules we needed like JSON , DOM and heck even Mysqli were not getting added!

    PHP Warning:  PHP Startup: apc: Unable to initialize module\nModule compiled with module API=20050922\nPHP    compiled with module API=20090626\nThese optio
    ns need to match\n in Unknown on line 0
    PHP Warning:  PHP Startup: dbase: Unable to initialize module\nModule compiled with module API=20050922\nPHP    compiled with module API=20090626\nThese opt
    ions need to match\n in Unknown on line 0
    PHP Warning:  PHP Startup: dom: Unable to initialize module\nModule compiled with module API=20050922\nPHP    compiled with module API=20090626\nThese optio
    ns need to match\n in Unknown on line 0
    PHP Warning:  PHP Startup: json: Unable to initialize module\nModule compiled with module API=20050922\nPHP    compiled with module API=20090626\nThese opti
    ons need to match\n in Unknown on line 0

    Given enough time, I can track down their causes and fix these errors, but time was exactly what I was not having on my hands. So tried looking around to find out if there are any RPMs available from third party repositories that would update PHP to version 5.3.1. Fortunately there was.  The repo is called Web Tactic. Update using the RPMs took just seconds and good news, no more segmentation faults. The bad news is that APC is no longer available.  I tried  I tried ‘yum install php-pecl-apc’  without success. Then tried ‘pecl install apc’ and got the following error:

    /tmp/pear/temp/APC/php_apc.c: In function ‘zif_apc_compile_file’:
    /tmp/pear/temp/APC/php_apc.c:881: warning: unused variable ‘eg_class_table’
    /tmp/pear/temp/APC/php_apc.c:881: warning: unused variable ‘eg_function_table’
    /tmp/pear/temp/APC/php_apc.c: At top level:
    /tmp/pear/temp/APC/php_apc.c:959: error: duplicate ’static’
    make: *** [php_apc.lo] Error 1
    ERROR: `make’ failed

    Oh well, can live without APC for the moment.

    Monday, March 1st, 2010 at 17:15
  • Bug in Twitter list get members?

    If you add yourself to a list and then call the get list members method you might be surprised by the result.

    Add user to a list

    One of the data items returned by the get list id method includes the number of members in your list. The get list members method returns 20 members at a time and you need to call the it multiple times using different cursor locations to retrieve the complete membership. When you count them and compare against the number returned by the get list id method, you will find that they do not match. Your own account is not part of the dataset that is returned.

    the XML from the twitter api call

    By the time you are reading this post, the list might have changed. So I have saved a copy of the XML here. On the other hand when you access the list through twitter.com, you can see that I am included in my own list. In fact my screen name (e4c5) appears twice!

    the twitter list

    The work around I suppose is to not rely on the members count returned by the get list id method.

    Sunday, February 7th, 2010 at 07:02
TOP