The arXiv, an electronic preprint scientific paper repository, receives on average 7000 submissions per month. That’s a lot of papers and text to tackle, if you’re Cornell graduate student Alexander Alemi.
Alemi is analyzing text within months of arXiv aritcles with help from three other Cornell scientists, including arXiv developer Paul Ginsparg. In the past, Ginsparg has toyed with ideas for a different arXiv that never came to fruition. After clicking on an article, for example, the site would suggest related articles and authors. In a way, Alemi hopes to resurrect this idea and eventually bring it to arXiv users.
“Eventually it will help authors find other authors they want to read and help readers find papers they think are interesting,” Alemi said.
At the same time, Alemi is also dissecting the data and recording what types of words appear together most often. For example, when Alemi punches the word “purple” into the model he is using, he learns that orange is purple’s most likely partner followed swiftly by yellow and then magenta. Understanding whether these pairs are trivial or nontrivial, and if not why, opens a new archive of questions.