Categories:
Creating a Rudimentary Tokenizer for the Delicious Data Set
In order to tokenize all the files in the Delicious data set to facilitate searching, I came up with the following basic scheme:
- Decode all XML entities leaving only the actual character they represent (see html_entity_decode).
- Strip out all HTML tags (see strip_tags).
- Convert everything to lower case (see strtolower).
- Strip out all numbers using the \b\d+\b regular expression (see preg_replace).
- Remove all non-alphabetic characters left.
- Collapse multiple spaces into one space.
This strategy translates very cleanly into PHP as shown below:
class WordTokenizer { public static function tokenizeFile($file) { $contents = file_get_contents($file); $contents = html_entity_decode($contents); $contents = strtolower(strip_tags($contents)); // strip numbers $contents = preg_replace('/\b\d+\b/', ' ', $contents); // strip punctuation symbols & non-alphabetic symbols $contents = preg_replace('/[^a-z]/', ' ', $contents); // replace multiple spaces with just one $contents = preg_replace('/\s+/', ' ', $contents); return $contents; } }
The full source code is available in the information retrieval repository.
html_entity_decode
Leave a Reply