Saint's Log

11May/110

Attaching Java Source Files to an Eclipse JRE

As per this ubuntu thread, the solution is:

  1. Window > Preferences > Java > Installed JREs.
  2. Double click on the JRE for which you want to attach source code.
  3. Under "JRE System Libraries", select rt.jar.
  4. Click on the "Source Attachment..." button.
  5. Supply a path to the source zip file, e.g. C:/Program Files/Java/jdk1.6.0_23/src.zip

That zip file was already on my system (since the JDK was installed) but some projects were using the JRE rather than the JDK. Therefore, I needed to specify the JDK source zip file for the JREs as well.

Filed under: Eclipse No Comments
6May/110

Creating a Rudimentary Tokenizer for the Delicious Data Set

In order to tokenize all the files in the Delicious data set to facilitate searching, I came up with the following basic scheme:

  1. Decode all XML entities leaving only the actual character they represent (see html_entity_decode).
  2. Strip out all HTML tags (see strip_tags).
  3. Convert everything to lower case (see strtolower).
  4. Strip out all numbers using the \b\d+\b regular expression (see preg_replace).
  5. Remove all non-alphabetic characters left.
  6. Collapse multiple spaces into one space.

This strategy translates very cleanly into PHP as shown below:

class WordTokenizer
{
	public static function tokenizeFile($file)
	{
		$contents = file_get_contents($file);
		
		$contents = html_entity_decode($contents);
		
		$contents = strtolower(strip_tags($contents));
		
		// strip numbers
		$contents = preg_replace('/\b\d+\b/', ' ', $contents);
		
		// strip punctuation symbols & non-alphabetic symbols
		$contents = preg_replace('/[^a-z]/', ' ', $contents);
		
		// replace multiple spaces with just one
		$contents = preg_replace('/\s+/', ' ', $contents);
		
		return $contents;
	}
}

The full source code is available in the information retrieval repository.

html_entity_decode