MapReduce with n-gram Analysis Practical Assignment

Assignment Task

In this task, we are going to work again with the data and parser known from the previous assignment (alex165k.xml and MRDPUtils.java). Additionally, the file titles1m.xml pro- vided in Moodle will act as a secondary data source. It contains the “publication year” and “title” of many publications, identified by their arXiv -IDs and OpenAlex -IDs as unique keys. titles1m.xml was created from the full OpenAlex dataset and contains only entries matching keys used in alex165k.xml.

  • Write a MapReduce application that uses a reduce-side join in order to match in- formation about publications based on their arXiv -ID. Use paper_id == titles1m.arxiv_id for the join. We are especially interested in combining authors, discipline, year and title for each publication.points Write a MapReduce application that computes all character-level n -grams (with 10 n 15) that appear at least 5,000 times in the input text, using the naive approach discussed in the lecture. Represent an n -gram in a reasonable way. Split up a line into its characters, after removing punctuation and converting the text to lowercase, and generate n -grams from these characters.

Hand in your code and the output of the reducer when run on the input file corpus.txt which is available in Moodle (as a compressed file); this should yield 13 results. These tasks will be discussed in the tutorial on July 8 , 2024. Note that there will be no tutorial on July 1, 2024.