We live in the era of digital age, the era where knowledge and information is power and considered to be the key in our overall progress. Every day, hundreds of documents appear in digital format in the form of articles in newspapers or magazines, online books, scientific discoveries and publications. Without doubt, we can say that the amount of information we have in our hands is endless. In the fields of biology, medicine and their subfields, every moth thousands of articles are being published online regarding established reviews and analytics, discoveries and new experiments. Handling this infinite data is dire for the aforementioned fields.
The growing number of literature has become a significant problem for scientist and researchers. This amount of literature is becoming impossible to follow even for the most experienced readers, leading to a waste of money and research time.
The arising question is the following: “ Are we able to handle this amount of data ? And if yes, how will we find whatever we are looking for in time in order to make the right decisions?”.Bioinformatics and mining biological literatureIn Bioinformatics,when we want to find and extract knowledge from textual data we focus on specific relations between the entities. We target mostly interactions between proteins, genes, drugs, diseases etc.Now, take a step back and think of a significant biological field….
Let’s talk about genomics and explain shortly what it is.A genome represents the entire DNA substance that is present in one cell. Using DNA sequence techniques and bioinformatics we can analyze the structure and function of a genome. We can study how genes interact with each other or/and with the environment they reside. The scientific field focusing on the operation of genomes is called Genomics. Genomics experts attempt to unfold all the mysteries of the DNA sequence in order to give answers to complex challenges. For example, genomics focuses on the examination of genomes appearing in serious diseases such as cancer, diabetes, heart diseases and many more.
As you can imagine, genomics is just one among dozens of biological fields creating huge amounts of new data every. And yes, keeping up with this kind of information is not easy.A new field : Text miningAs we mentioned before, in the numerous subfields of biology such as genomics we get tons of information in the form of numbers, sequences and genomes but we also get something else. As a logical consequence, tons of plain text join the scientific publications. It is the essential “literature”, where scientists describe their thoughts, explain their methodology and analyze their conclusions.
This textual data is considered to be a great tool for those who can handle it and use it accordingly. At the same time though, this phenomenon generates a new field of research. This new research field is called Text Mining. TM , as we are going to call it from now on, focuses exclusively on discovering and extracting unknown literature texts by combining sophisticated methods of machine learning, computational linguistics and informational retrieval.
By using these techniques we will be able to gain significant time in information extraction, which will lead to a more promising hypothesis generation.In human genomics, this automated gene and protein detection seems very promising, as we have a significant new amount of reports establishing new variables about rare diseases. Being able to study, evaluate and connect this new variables to existing information is crucial. We have to mention though that due to copyright reasons very few articles are free to read online, hence TM is focusing on titles and abstracts which are freely accessible in databases such as BMC and MEDLINE.
How TM works?TM, as we mentioned earlier, is about discovering unstructured knowledge. Most of times we have to deal with three major objectives: identifying the essential data, extracting information and detecting associations between the already extracted data. We can imagine TM as a curator, searching all the available resources such as online publications, patents, journals and so on, finding the available texts, linking them all together and categorizing them.
To begin with, while extracting biological literature from a text we should be able to identify it. Biological entities can be proteins, cells, genes, genomes, diseases, chemical compounds and many more other biological definitions. Afterwards, we will have to do Named Entity Recognition (NER) and Term Normalization that is, distinguishing, storing and sorting into categories our findings and associating them with the right entities in our database.
Next step, will be checking the relation between the stored entities, to define what kind of relation that is as well as the type of it.Named entity recognition (NER) Our first thoughts considering NER should focus on two problems: First of all, the ever-evolving literature of biology. There are millions of definitions referring to genes, proteins, patterns, compounds etc as well as many more that are being created as we write this very text. Secondly, the similarity in acronyms or abbreviations in biological terminology and the variety of definitions an entity can have.
For example, entities P53, TP53 and TRP53 relate to the same gene or when our imaginary curator comes across the word “Parkinson” and has to make a choice and decide whether it is referring to James Parkinson, who was first to study Parkinson’s disease, or the disease itself.In order to address these problems, a new committee was created.The HUGO Gene Nomenclature Committee (HGNC) targets on appointing a unique name and symbol for all known genes and until now HGNC has done a great job assigning names and symbols in over 35.000 entities. This number is big however there are more entities out there still unassigned. Others memorable mentions are BioCreative and NLPBA.
There are three main methods used for NER (Hybrid methods can also be used):
- Machine learning
They use simple text-matching algorithms with a preset dictionary. We search text and then we match our findings with the entities of our dictionary. Dictionary-based techniques are extremely dependent on preset dictionaries and the matching algorithms used and that is why they develop a large number of ambiguous results.
These kind of methods focus on recognizing entities based on symbols, numbers, and suffixes/affixes. For example, many biological entities end with specific suffixes such as -in, for instance Keratin (a fibrous structural family) or Myosin (motor proteins known for taking part in muscle contraction). Thus, these methods create rules that help categorize words with specific orthographic features as teams. Ruled-based methods are considered very accurate as with a simple rule they can classify a big number of entities. On the other hand, due to the variant grammatical and syntactical rules of our language, they are not so agile.
Machine learning methods are considered to produce the best results for NER. They use large amounts of annotated data sets in order to identify and classify entities of text. We have two major machine learning techniques: Classification and Sequence labeling. Today, these methods are being used increasingly, compared to ruled-based and dictionary-based methods. Term normalization After having found and flagged our results by using NER methods, we must link them with the appropriate entries in our databases.
Term normalization compares entities and assigns the matching identifier. Here we have to mention again the difficulty in associating and matching entities based on the biological literature. The genomic nomeclature is rich and at the same time ambiguous (genes/proteins can result to more than one identifiers). Today, one of the most used knowledge database is the Gene Ontology (GO) which aims in the development of a computational model describing the properties and functions of genes.
Up until this point we should be able to find and flag our entities. The following would be to extract them along with their relations with other significant terms in our text. The best way of doing so would be by looking the repentance of entities of a given text. This task could be very challenging due to various ways that a relationship can be described in the text. For example, “..GIPC1 binds ITGA5..” or “..binding of GIPC1 and ITGA5..”. Additionally, a relation between two entities can be analyzed and described over many given sentences, or a scientist can speculate about a relation between two entities.
There are several approaches to face these problems. We use Machine learning or ruled-based methods along with the help of linguistics.Especially ruled-based methods use a set of rules that describe the interactions of the entities. One of the most used ruled-based methods is called RLIMS-P, which uses sets of rules and patterns in order to extract information about protein phosphorylation sites.Hypothesis GenerationLast but not least, we have the hypothesis generation. Studies and researches introduce new information to the world and by studying this knowledge we can end up with new hypothesis.
A great example of hypothesis generation was made by Swanson in 1986. Swanson discovered the connection between Raynaud’s disease and fish oil (fish oil can be used as a treatment for Raynaud’s disease), which was clinically proved after two years.. Automated hypothesis generation is not an easy task and can be very challenging. Using computers in order to extract knowledge out of published papers and logically connecting them in order to reach to a conclusion is probably the most intriguing about Text Mining.
Text mining has many steps to cover in order to make biological knowledge accessible. The adoption and the enhance of the TM methodologies is a challenge for both bioinformaticians and text-miners experts.Bioinformatics is forced to use automated techniques in order to extract data as long as biologists write texts. TM although new and challenging, is a growing field of study. Creating tools for knowledge extraction, identification of relations and hypothesis generation is crucial for better understanding the mysteries of genomics and biology in general. There is still a big amount of knowledge undiscovered and TM along with machine learning could be a great tool for biologists.