Categories: Information Retrieval

Creating a Rudimentary Tokenizer for the Delicious Data Set

In order to tokenize all the files in the Delicious data set to facilitate searching, I came up with the following basic scheme:

  1. Decode all XML entities leaving only the actual character they represent (see html_entity_decode).
  2. Strip out all HTML tags (see strip_tags).
  3. Convert everything to lower case (see strtolower).
  4. Strip out all numbers using the \b\d+\b regular expression (see preg_replace).
  5. Remove all non-alphabetic characters left.
  6. Collapse multiple spaces into one space.

This strategy translates very cleanly into PHP as shown below:

class WordTokenizer
{
	public static function tokenizeFile($file)
	{
		$contents = file_get_contents($file);
		
		$contents = html_entity_decode($contents);
		
		$contents = strtolower(strip_tags($contents));
		
		// strip numbers
		$contents = preg_replace('/\b\d+\b/', ' ', $contents);
		
		// strip punctuation symbols & non-alphabetic symbols
		$contents = preg_replace('/[^a-z]/', ' ', $contents);
		
		// replace multiple spaces with just one
		$contents = preg_replace('/\s+/', ' ', $contents);
		
		return $contents;
	}
}

The full source code is available in the information retrieval repository.

html_entity_decode

Article info



Leave a Reply

Your email address will not be published. Required fields are marked *