Assignment Task
In this task, we are going to work again with the data and parser known from the previous assignment (alex165k.xml and MRDPUtils.java). Additionally, the file titles1m.xml pro- vided in Moodle will act as a secondary data source. It contains the “publication year” and “title” of many publications, identified by their arXiv -IDs and OpenAlex -IDs as unique keys. titles1m.xml was created from the full OpenAlex dataset and contains only entries matching keys used in alex165k.xml.
- Write a MapReduce application that uses a reduce-side join in order to match in- formation about publications based on their arXiv -ID. Use paper_id == titles1m.arxiv_id for the join. We are especially interested in combining authors, discipline, year and title for each publication.points Write a MapReduce application that computes all character-level n -grams (with 10 n 15) that appear at least 5,000 times in the input text, using the naive approach discussed in the lecture. Represent an n -gram in a reasonable way. Split up a line into its characters, after removing punctuation and converting the text to lowercase, and generate n -grams from these characters.
Hand in your code and the output of the reducer when run on the input file corpus.txt which is available in Moodle (as a compressed file); this should yield 13 results. These tasks will be discussed in the tutorial on July 8 , 2024. Note that there will be no tutorial on July 1, 2024.
