Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Here are some recommendations for linking the names of places/locations mentioned in a gazetteer with those found in a corpus file:

  1. Use string matching algorithms: You can use string matching algorithms such as Levenshtein Distance or Jaro-Winkler Distance to compare the names in the gazetteer with those in the corpus file. This can help identify possible matches even if they are not exact matches.

  2. Use contextual features: You can use contextual features such as nearby words or phrases to help identify the correct match. For example, if the corpus file mentions "New York City" and the gazetteer has entries for "New York City" and "York City," you can use the surrounding words in the corpus file to identify which entry in the gazetteer is the correct match.

  3. Use a named entity recognition (NER) system: You can use a NER system to automatically identify and extract named entities from the corpus file. This can help identify which entities in the gazetteer correspond to entities in the corpus file.

  4. Normalize the names: You can normalize the names in both the gazetteer and corpus file to a standard format (e.g. lowercase, removing diacritics, etc.) to reduce the number of false negatives.

  5. Combine multiple approaches: You can combine multiple approaches to improve the accuracy of the linking process. For example, you might use string matching algorithms to identify possible matches, and then use contextual features to determine the correct match.