In order to tokenize all the files in the Delicious data set to facilitate searching, I came up with the following basic scheme:
- Decode all XML entities leaving only the actual character they represent (see html_entity_decode).
- Strip out all HTML tags (see strip_tags).
- Convert everything to lower case (see strtolower).
- Strip out all numbers using the \b\d+\b regular expression (see preg_replace).
- Remove all non-alphabetic characters left.
- Collapse multiple spaces into one space.
This strategy translates very cleanly into PHP as shown below:
class WordTokenizer
{
public static function tokenizeFile($file)
{
$contents = file_get_contents($file);
$contents = html_entity_decode($contents);
$contents = strtolower(strip_tags($contents));
// strip numbers
$contents = preg_replace('/\b\d+\b/', ' ', $contents);
// strip punctuation symbols & non-alphabetic symbols
$contents = preg_replace('/[^a-z]/', ' ', $contents);
// replace multiple spaces with just one
$contents = preg_replace('/\s+/', ' ', $contents);
return $contents;
}
}
The full source code is available in the information retrieval repository.
One of my assignments this past semester involved performing search operations on the tags supplied with the delicious data set. I created a database named “contentindex” and then a table for the tag information from the XML file using the following SQL:
CREATE TABLE `tag` (
`docname` VARCHAR(255) NOT NULL,
`tagname` VARCHAR(255) NOT NULL,
`weight` INT NOT NULL,
UNIQUE index_doc_tag_pair_weight (`docname`, `tagname`)
)
ENGINE = InnoDB;
I set up an Eclipse Java project for the assignment. The MySQL Connector/J JDBC driver is required for interaction with the database. You will need to add it to the unzipped JAR file to the Delicious Eclipse project. To do so, right click on the project -> Build Path -> Add External Archives…
A list of XML parsers that could be used on the tag info XML file is available on the SAX website. I settled on the Xerces parser. Setup was a straightforward extraction into a folder, creation of a corresponding Java project in Eclipse, and finally adding the Xerces project to the Delicious Eclipse project’s build path – Right click on the project -> Build Path -> Configure Build Path…. -> “Projects” Tab -> Add…
The source code for the XML tag info parser is available on my github repo.
Start with the CS462 AMI.
Edit multiverse.list
sudo vi /etc/apt/sources.list.d/multiverse.list
Add the following lines to multiverse.list:
deb http://us.ec2.archive.ubuntu.com/ubuntu/ karmic multiverse
deb-src http://us.ec2.archive.ubuntu.com/ubuntu/ karmic main
Then run the following commands:
sudo apt-get update
sudo apt-get install apache2
sudo apt-get install php5 php5-cli php-pear php5-gd php5-curl
sudo apt-get install libapache2-mod-php5
sudo apt-get install libapache2-mod-python
sudo apt-get install ec2-ami-tools
sudo apt-get install ec2-api-tools
sudo apt-get install python-cheetah
sudo apt-get install python-dev
sudo apt-get install python-setuptools
sudo apt-get install python-simplejson
sudo apt-get install python-pycurl
sudo apt-get install python-imaging
sudo apt-get install subversion
sudo apt-get install git-core
Note: the sun-java6-bin and libphp-cloudfusion packages are not strictly necessary (OpenJDK will be installed instead of the former, and the AWS PHP SDK instructions are given below instead of the latter). unzip could come in handy as well. The python packages are installed to allow for python web development without having to install the appropriate packages after starting the server.
git config --global user.name "Johny Boy"
git config --global user.email johnyboy@gmail.com
sudo vi /etc/apache2/sites-available/default
Next, install Smarty as per the Smarty documentation (lines 1-3) and the Zend Framework as well (lines 4-7) since it may come in handy.
cd /usr/local/lib
sudo wget http://www.smarty.net/files/Smarty-3.0.7.tar.gz
sudo tar vxzf Smarty-3.0.7.tar.gz
cd /opt
sudo wget http://framework.zend.com/releases/ZendFramework-1.11.4/ZendFramework-1.11.4-minimal.tar.gz
sudo tar vxzf ZendFramework-1.11.4-minimal.tar.gz
sudo mv ZendFramework-1.11.4-minimal ZendFramework-1.11.4
Install System_Daemon as well to enable running PHP Daemons. There’s also a sample daemon illustrating how to use this class.
sudo pear install -f System_Daemon
Clone the AWS PHP SDK into /usr/share/php as documented in the “Getting Started with the AWS SDK for PHP” tutorial (lines 1-3) and then configure the SDK security credentials (lines 4-6).
sudo mkdir -p /usr/share/php
cd /usr/share/php
sudo git clone git://github.com/amazonwebservices/aws-sdk-for-php.git awsphpsdk
mkdir -p ~/.aws/sdk
cp /usr/share/php/awsphpsdk/config-sample.inc.php ~/.aws/sdk/config.inc.php
vi ~/.aws/sdk/config.inc.php
Now we can prepare to create the image and then run the ec2 commands to create, upload, and register the image. See the AMI tools reference for information about these commands. Of course the actual access key, secret key, bucket names, etc need to be substituted with the correct values.
cd /mnt
sudo mkdir image
sudo mv /home/ubuntu/PrivateKey.pem .
sudo mv /home/ubuntu/X509Cert.pem .
sudo ec2-bundle-vol -k PrivateKey.pem -c X509Cert.pem -u 999988887777 -d /mnt/image
sudo ec2-upload-bundle -b cs462-machines/mybucket -m /mnt/image/image.manifest.xml -a AKIADQKE4SARGYLE -s eW91dHViZS5jb20vd2F0Y2g/dj1SU3NKMTlzeTNKSQ==
ec2-register cs462-machines/mybucket/image.manifest.xml --K PrivateKey.pem -C X509Cert.pem
Once the process is complete, the instance can be launched with the following user data:
#! /bin/bash
sudo git clone git://github.com/pathtorepo/cs462.git /home/ubuntu/www > /home/ubuntu/gitclone.log
sudo chown -R ubuntu /home/ubuntu/www/
sudo chown nobody:nogroup /home/ubuntu/www/smarty/templates_c/
sudo chown nobody:nogroup /home/ubuntu/www/smarty/cache/
sudo chmod 770 /home/ubuntu/www/smarty/templates_c/
sudo chmod 770 /home/ubuntu/www/smarty/cache/
Note that the owner of the checked out www folder is set to ubuntu to ensure files can be edited conveniently without sudo. The “nobody” user is then made the owner of the smarty folders and they are assigned to the “nogroup” group. The permissions are then set to 770 for maximum security. I actually ended up using 777 to speed up development on my server – see the apache error log if nothing is displayed from the templates (most likely a case of permission errors).
Here’s are some options to include in the apache configuration file:
DocumentRoot /home/ubuntu/www/htdocs
<Directory /home/ubuntu/www/htdocs/>
Options Indexes FollowSymLinks MultiViews
AllowOverride None
Order allow,deny
allow from all
DirectoryIndex index.php index.html index.py
AddHandler mod_python .py
AddHandler php5-script .php
PythonHandler mod_python.publisher
PythonDebug On
</Directory>
I ended up pushing my server configuration as well to a public git server containing my entire application. Server configuration is then reduced to:
sudo cp /home/ubuntu/www/serverconfig/apache/appserver/default /etc/apache2/sites-available/default
sudo apache2ctl restart
The Racket website has documentation on how to clone the PLT repository.
git clone git://git.racket-lang.org/plt.git
Next, the src/README file has all the gory details on how to build Racket. The procedure is rather straightforward for Linux:
mkdir build
cd build
../configure
make
make install
I’m yet to figure out how to successfully build racket in Visual C++ (2008 Professional), so that’s the next item on my list.
Update (03/11/11): Actually rather straightforward for Visual C++ as well, run vsvars32.bat to ensure that devenv and other commands are in the path:
cd plt\src\worksp\
"C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\Tools\vsvars32.bat"
build
I got the hint from this thread.
Bug 498826 is about the HTML canvas putImageData method. It did not implement the optional arguments specified in the WHATWG spec. These optional arguments specify the dirty rectangle for the image data transfer (specifically, these arguments are the coordinates of the top left corner and the dimensions of the dirty rectangle, any of which are allowed to be negative). A quick glance at the description of the algorithm for handling of the optional arguments may not reveal the overall intent of the algorithm. Some of its key aspects are:
- Adjusting the dimensions of the rectangle to be positive by shifting the top left corner if necessary (step 2).
- Ensuring the top left corner of the dirty rectangle is in the first quadrant (step 3) which effectively eliminates all negative arguments.
- Clipping the dirty rectangle to ensure its lower right corner does not extend beyond the bounds of the incoming imagedata’s dimensions (step 4).
- Verifying that the newly adjusted dirty rectangle has positive dimensions (step 5), and if so, using the region bounded by the dirty rectangle on the incoming imagedata object as the source for the operation.
The patch is rather straightforward (although admittedly, it was not as straightforward to create on my part). If there are enough arguments to specify a dirty rectangle, then the JS_ValueToECMAInt32 function is used to convert the JavaScript values into integers. The CheckedInt32 and the gfxRect classes do most of the heavy lifting in the patch, and then only the dirty rectangle is redrawn.
Bug 520914 is the only Bugzilla item I got to look at this month – since it required the least amount of time :). All that needed to be done was to remove the aPrompt argument currently in nsICookieService’s SetCookieString method. The locations in need of change were easily located with a simple MXR search.
While digging around in Bugzilla (as is now my usual daily custom), I came across Bug 580468 – JM: Tune Trace JIT Heuristics. It was interesting following the discussion since it was a perfect illustration of the principles being taught in CS470. Therefore, I wrote up this brief summary of the bug discussion to reinforce to myself how practical these issues are: Practical AI.
Having not used git that much, I cloned an ANTLR repository and then did my main development (and the associated commits) on the main branch. In order to submit my work, I needed to create one patch file with all the necessary changes. However, git format-patch was creating one .patch file for each commit. Some googling around led me to stackoverflow where the basic outline of how to fix this situation was clear. Based on those suggestions, and the fact that there had been no remote changes that I needed to merge, I came up with:
git checkout -b submission 76b45
git merge --squash master
git commit -a -m "Submission of work..."
git format-patch 76b45
In the rather straightforward bug 562433, Firefox’s location.host and location.hostname need to return the empty string for host-less URIs instead of throwing an exception. What ends up wasting my time with such 1-minute fixes is figuring out the right test location and the right command to run the test. At least documenting this should save a few minutes next time:
TEST_PATH=dom/tests/mochitest/bugs/test_bug562433.html make -C /c/mozilla/obj mochitest-plain
See the Mochitest automated testing framework documentation for details.
As suggested on one of the Ubuntu forums, the key here is to install the VirtualBox guest additions. Having done so on my system, I ran these commands:
cd /media/VBOXADDITIONS_3.2.6_63112/
sudo ./VBoxLinuxAdditions-x86.run
Rebooting my virtual machine and maximizing the VirtualBox window left me running Ubuntu at my native screen resolution of 1680×1050 :).
Update: On VirtualBox 4.1.2, use the virtual machine’s Devices -> Install Guest Additions … menu item. The ISO Disc should be automatically mounted, and allowing autorun to continue should complete the installation. The Virtualbox website has more information on guest additions.