msir2016.github.io - MSIR Shared Task

Example domain paragraphs

A large number of languages, including Arabic, Russian, and most of the South and South East Asian languages, are written using indigenous scripts. However, often the websites and the user generated content (such as tweets and blogs) in these languages are written using Roman script due to various socio-cultural and technological reasons. This process of phonetically representing the words of a language in a non-native script is called  transliteration . Transliteration, especially into Roman script, is use

Two pilot subtasks on transliterated search were introduced as a part of FIRE 2013. Subtask 1 was on language identification of the query words and then transliteration of the Indian language words. The subtask was conducted for three Indian languages - Hindi, Bangla and Gujarati. Subtask 2 was on ad hoc retrieval of Bollywood song lyrics - one of the most common forms of transliterated search that commercial search engines have to tackle. Five teams had participated in the shared task.

In FIRE 2014, the scope of subtask 1 was extended to cover three more South Indian languages - Tamil, Kannada and Malayalam. In subtask 2, we introduced (a) queries in Devanagari script, and (b) more natural queries with splitting and joining of words. More than 15 teams participated in the tasks.