As per this ubuntu thread, the solution is:
- Window > Preferences > Java > Installed JREs.
- Double click on the JRE for which you want to attach source code.
- Under “JRE System Libraries”, select rt.jar.
- Click on the “Source Attachment…” button.
- Supply a path to the source zip file, e.g. C:/Program Files/Java/jdk1.6.0_23/src.zip
That zip file was already on my system (since the JDK was installed) but some projects were using the JRE rather than the JDK. Therefore, I needed to specify the JDK source zip file for the JREs as well.
In order to tokenize all the files in the Delicious data set to facilitate searching, I came up with the following basic scheme:
- Decode all XML entities leaving only the actual character they represent (see html_entity_decode).
- Strip out all HTML tags (see strip_tags).
- Convert everything to lower case (see strtolower).
- Strip out all numbers using the \b\d+\b regular expression (see preg_replace).
- Remove all non-alphabetic characters left.
- Collapse multiple spaces into one space.
This strategy translates very cleanly into PHP as shown below:
class WordTokenizer
{
public static function tokenizeFile($file)
{
$contents = file_get_contents($file);
$contents = html_entity_decode($contents);
$contents = strtolower(strip_tags($contents));
// strip numbers
$contents = preg_replace('/\b\d+\b/', ' ', $contents);
// strip punctuation symbols & non-alphabetic symbols
$contents = preg_replace('/[^a-z]/', ' ', $contents);
// replace multiple spaces with just one
$contents = preg_replace('/\s+/', ' ', $contents);
return $contents;
}
}
The full source code is available in the information retrieval repository.