- The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
edited to add: this story by Ted Chiang gives an interesting picture of where this could go.
The end is near O.O I was wondering how long it will take until something like this shows up. A common discussion topic in our group is the future of scientific publishing. My boss, who profited a lot from the current system, believes that things should continue this way, with a "story driven format". As in, you do experiments after experiments in your lab until you have a "story" to tell. And then you package all those data to make a nice story and publish it as such. I think that this approach will die out. It is inefficient and a lot of stuff is cherry-picked to fit your current beliefs. What happens, you get only part of the data, part of the "actual story". Things that do not fit your argument, or even contradict it, are swept under the rug or simply ignored. A better approach, in my opinion, would be a "by experiment" approach. You basically plan an experiment based on previous data, declare that you are doing it, do the experiment, and publish the results. Rinse and repeat. At the end,you can write a story, if you like. It will serve scientific communication, but not your experts and colleages. They know whats happening anyway. His main argument against such an approach is that we would flood the scientific world with tidbits of information that is incoherent. We need to put things into a story for people to remember and process it. True, people need that. Not computers with powerful algorithms. I am kinda stocked and scared about this.
The other problem IMO with the story-based approach is that when we're writing a story, we feel the need to be exciting. One of the current problems right now is that no one's willing to do the work to confirm results, because everyone wants to find something "new."