Automated Bibliography

Using the seed papers as input, we trained a custom recommender to identify similar papers that may be important in the field. The top papers ranked by relevance are included here in the Automated Bibliography Page. Additional details about the data and methods are provided below. You can download the data directly using the button above.

Below are the papers that are in the science communication seed set, as well as the additional papers this method identified as being related to this seed set (and marked as relevant by domain experts at NAS) To view the papers related to both science communication and misinformation, click here.


We use the seed papers and train an algorithm to find related papers. The idea is to find, automatically, other relevant papers in science communication and misinformation research. The ranking task is set up as a supervised machine learning problem. A portion of the seed papers are set aside to be used to generate features, and the remaining are combined with a large set of candidate papers to create the total candidate set. This candidate set is split into train and test sets, and the process proceeds as a binary classification task, with the positive class being those papers that were originally among the seed papers. The features used include structural features of the citation network and textual features. The structural features include a measure of how close a paper tends to be to the set-aside seed papers in the citation network (using the Infomap clustering algorithm), as well as the pagerank value of the paper (a citation-based measure of importance). The textual feature used was the average cosine similarity of the paper’s title (tf-idf weighted) to the titles of the set-aside seed papers. A random forest classifier was trained using these features, and the test papers were ranked according to the probability of the classifier. The overall process was repeated several times, and the probabilities were summed. The Automated Bibliography is presented as the top-ranked papers by relevance, after excluding any papers that were originally in the seed set.

The method is described in full detail in this publication, and the code is available here.


Please contact us if you have any questions or comments about the tools, data or other content.