About me

I’m Balu, from Alappuzha (Lord Curzon named it as “Venice of the East”) a city on the Laccadive Sea best known for houseboat cruises along the rustic backwaters, a network of tranquil canals and lagoons in the southern Indian state of Kerala (God’s own Country).

I am currently a Post Doctoral Fellow at eHealth Lab in School of Information (iSchool), College of Communication & Information at Florida State University working under Prof. Zhe He. Prior to this I was a post doc at The Bakar Computational Health Sciences Institute (BCHSI), University of California, San Francisco under Dr. Vivek Rudrapatna in the The Real-World Evidence Lab. I have a Ph.D. in Biomedical Informatics under Prof. Jeyakumar Natarajan in the DRDO-BU Center for Life Sciences, TamilNadu, India. I am a proud member of the Data Mining and Text Mining Laboratory & Computational Biology Division @ DRDO-BU CLS. I’ve concentrated on automatic extraction of information from biomedical literature using machine learning algorithms and techniques, to bridge the gap between the slow growing manually curated databases and the vast information already available in free text in published articles.

Research Interest

I am a researcher focuses on Information extraction and Machine learning in Clinical data science and Biomedicine.
I have 8+ years of experience in developing classification and prediction models.
I am interested in Natural Language Processing, Electronic Health Records, Machine Learning and Information Extraction
I am also exploring the recent trends in large langugale models (LLMs), causal inference, literature based discovery (LBD), network visualization, knowledge graphs (KGs), and explainable AI (XAI).

Research Groups

eHealth Lab at Florida State University (FSU), USA

The Real-World Evidence Lab, University of California, San Francisco (UCSF), USA

Data mining and Text mining Laboratory, Bharathiar University, India

Computational Biology division, DRDO-BU CLS, India

Experience

July 2023- Post Doctoral Fellow at eHealth Lab in School of Information (iSchool), College of Communication & Information at Florida State University, USA

August 2021-July 2023 Post Doctoral Fellow, The Bakar Computational Health Sciences Institute (BCHSI), University of California, San Francisco (UCSF), USA

December 2020-July 2021 Affiliate, The Bakar Computational Health Sciences Institute (BCHSI), University of California, San Francisco (UCSF), USA

September 2014- December 2020 Full time PhD research scholar in the field of Biomedical Literature Mining at Bharathiar University, Coimbatore, Tamilnadu, India

July 2014 – June 2018 Junior Research Fellow (JRF) , Major Research Project (MRP) at DRDO-BU Center for Life Sciences, Coimbatore, Tamilnadu, India

Academics


• Post Doctoral Fellow at eHealth Lab in School of Information (iSchool), College of Communication & Information at Florida State University, USA[2023-Present]
• Post Doctoral Fellow, The Bakar Computational Health Sciences Institute (BCHSI), University of California, San Francisco (UCSF), USA[2021-2023]
• Ph.D. in Computational Biology, Bharathiar University, Tamilnadu, India [2014-2020]
• Master of Computer Applications, Mahatma Gandhi University, Kerala, India [2010-2013]
• B.Sc. in Computer Science, University of Kerala, Kerala, India [2007-2010]

Professional Affiliations

• Member of Indian Society For Technical Education
• International Association of Engineers (IAENG) (ID: 179630 )
• International Computer Science and Engineering Society (ICSES) (ID: 2795)
• International Economics Development Research Center (IEDRC) (ID: 90080866 )
• Research Fellow DRDO-BU Center for Life Sciences
• Ph.D. Scholar at Bharathiar University,India

Awards

a) Distant supervision for large scale extraction of gene-disease associations from literature using DeepDive, International Conference on Innovative Computing and Communication (ICICC-2018), Springer, Guru Nanak Institute of Management, West Punjabi Bagh, New Delhi, (Best paper award, Data Mining)
b) Text Mining to Identify Gene-Gene Interactions of High Altitude Diseases, International Symposium on Computational Biology and Bioinformatics, Kerala university (2016) (Best poster award)
c) Text Mining and Network Analysis to Identify Genes Related To High Altitude Diseases, Fifth Edition of National Workshop on Computer Vision, Image Processing Techniques and Data Analytics, Amrita School of Engineering Amrita Vishwa Vidyapeetham, (2015) (Best poster award)

Certifications

• Microsoft Certified Professional (MCP) in Microsoft .NET Framework Application Development Foundation
• National Eligibility Test (UGC .NET) in Computer Science
• Intro to Causal Inference
• Intro to Python for Data Science Course, DataCamp (Id: 3,014,861)
• Introduction to R Course, DataCamp (Id: 3,009,267)

Data Science and Machine Learning


• 8+ years of experience in text mining, natural language processing, machine learning applied to biomedical text data, information extraction and information retrieval
• Experience in working with Deep Learning platforms TensorFlow, Keras, PyTorch etc.
• Excellent understanding of machine learning techniques and algorithms, such as CRF, SVM, Graph Kernels, Decision Forests, etc. with ensemble learning using MALLET, LibSVM, EnsembleSVM etc.
• Experience in Python libraries for text data analyses and machine learning such as NLTK, Spacy, ScikitLearn, Word2Vec etc.
• Experience with common data science toolkits, such as R, Weka, NumPy, MATLAB etc.
• Experience with data visualization tools, such as Cytoscape, Gephi, Matplotlib, Seaborn, Ploty, D3.js, GGplot
• Proficiency in using query languages such as SQL, PL/SQL etc.
• Good applied statistics skills, such as distributions, statistical testing, regression etc.
Good scripting and programming skills in Python, C++, JAVA, R and C

Biomedical Domain


• Developed Named Entity Recognition Methodologies/Tools for Disease, Gene/Protein entities
• Developed Relation Extraction Methodologies/Tools for Gene-Disease Associations and Bio-molecular Events
• Developed a disease centric methodology by finding functional associations of genes in high altitude diseases
• Developed a data mining pipeline for finding potential regulatory sRNAs and their functional roles in Staphylococcus aureus
• Experience with the following NLP tasks: tokenization, part of speech tagging, morphological decomposition, chunking, segmentation, regular expressions using OpenNLP, Stanford CoreNLP, BioLemmatizer etc.
• Experience working with biomedical open-source ontologies and terminologies such as MeSH, UMLS, SNOMED, PharmGKB, OMIM etc.
• Experience with publication databases such as PubMed, PubMed Central, ClinicalTrials.gov etc.
• Experience working with open-source lexicons such as Dbpedia, WordNet etc.

Projects

1. Reducing diagnostic delays in Acute Hepatic Porphyria using electronic health records data and machine learning: a multicenter development and validation study
•Can machine learning be used to identify undiagnosed patients with Acute Hepatic Porphyria (AHP), a rare heritable condition?
•Train and evaluate models to separately predict referral and diagnosis

Using a cohort of recently seen patients at UCLA/UCSF, infer:
•Who is most likely to eventually be referred
•Who, among those, is likely to be diagnosed w/ AHP
•Sequentially enroll patients who meet both criteria to be tested

Code Repository:AHPPrediction

2. Accurate, Robust, and Scalable Machine Abstraction of Mayo Endoscopic Subscores from Colonoscopy Reports

Is natural language processing (NLP)a viable alternative to manually abstracting disease activity from notes?

Findings: We compared different methods for abstracting the Ulcerative colitis Mayo endoscopic subscore from colonoscopy reports. Classifiers trained using automated machine learning (autoML) achieved the greatest accuracy (97%), recognized when to abstain, generalized well to other health systems, required limited effort for annotation and programming, demonstrated fairness, and had a small carbon footprint. Meaning: NLP methods like autoML appear to be sufficiently mature technologies for clinical text classification, and thus are poised to enable many downstream endeavors using electronic health records data.

Code Repository: MayoClassifier

3. Improving adverse event detection related to biologic immunosuppressant use – a pilot study of the BERT deep learning model adapted to real-world clinical notes

The goals of this project were to,
1) develop a clinical language-specific adaptation of the BERT model by training it using 75 million machine-PHI-redacted clinical notes authored at the University of California, San Francisco (UCSF), and
2) adapt it to the task of serious adverse event detection from clinical notes. For this pilot study, we used outpatient clinical notes from the inflammatory bowel disease (IBD) clinic at UCSF and limited the scope of our model development to serious adverse events that resulted in a hospitalization and were associated with the use of biologc immunosuppressants for IBD. We compared our models against other publicly-available and state-of-the-art systems for clinical language inference, both on general benchmarked tasks as well as tasks pertaining specifically to adverse event detection.

Research Outcomes/Results Using 75 million clinical notes authored at UCSF, we successfully trained a dedicated clinical language model from scratch, called UCSF BERT. We evaluated its performance on multiple publicly benchmarked tasks related to clinical language inference. These tasks include identifying important medical concepts from clinical notes as well as correctly characterizing the relationships between these concepts. For example, identifying drug and diseases that are mentioned in notes, as well as recognizing relationships such as “Drug A is a treatment for disease B”. On these tasks, our model performs as well as or better than previously published models. These included versions of BERT that have been trained on publicly available documents related to biomedical and clinical topics.
We subsequently customized this model to perform tasks specific to serious adverse event detection. These included identifying temporal relationships between medications of interest and subsequent hospitalizations (e.g. the patient was hospitalized after starting drug A) as well as identifying the stated reasons for given hospitalizations (e.g. the patient was hospitalized for acute pancreatitis). These customized BERT models were compared to previously published, state of the art models for adverse event detection. These models were trained and evaluated using a collection of gastroenterology clinical notes written about patients with inflammatory bowel disease, many of whom received biologic treatments for their disease. Our best performing models achieved accuracy ranging from 88-96% and a macro F1 score ranging from 62-68%. The latter is a more stringent metric that reflects the general rarity of serious adverse events (defined here as those associated with a hospitalization) in our collection of notes. These models performed reasonably well despite the significant length of these clinical notes, and generally outperformed state-of-the-art models by up to 10%. Overall, these results support the feasibility and the value of using advanced clinical language models to automate the detection of adverse events from electronic health records data. These models may someday support the surveillance of drug safety in the post-marketing setting using new data sources.
Research Impacts
Facilitation of strategic relationships with expert groups and stakeholders: This project has established new relationships between the FDA and our university, comprised of stakeholders that include clinicians, pharmacists, researchers, software engineers, and patients. Our team is currently in the process of developing a follow-on grant that involves both FDA and UCSF stakeholders, to continue the development of this method. Scientific publications/citations in literature: We are in the process of submitting one paper describing the UCSF BERT model and are preparing a second (and possibly third) manuscript to report the results of the serious AE detection customizations of our BERT model.
Presentations at conferences/meetings: Our work has been accepted for presentation at the American Medical Informatics Association 2022 annual summit, and we anticipate additional conference presentations. We are also in the process of preparing an education session for the Drug Information Association meeting in 2023.


Code Repository: IBD_ADE
4. Patient trajectories and risk prediction in patients with NASH (Nonalcoholic steatohepatitis)
Primary:
In an incident cohort with NASH receiving the standard of care, predict the time to each of five major clinical outcomes (Compensated Cirrhosis, Decompensated Cirrhosis, Liver Transplant, Hepatocellular Carcinoma, Death)

Secondary:
Identify (modifiable) risk factors that predict progression towards these outcomes
Estimate the average, annual healthcare utilization and costs a/w NASH-related outcomes
Exploratory:
Time from the diagnosis of a patient’s first metabolic RF to an LFT or liver imaging order
Time to cardiovascular disease-associated mortality

Code Repository: NASHDetection


5. D-NER: Disease Name Recognizer from literature using stacked ensemble and fuzzy matching
The core concept behind D-NER tool is a hybrid methodology with various orthographical, semantic triggers and typographical features through ensemble machine learning coupled with in-house disease dictionary fuzzy matching.
[Publication: [here]]
Availability:
[Java, JSP, Java Script and Apache ]

6. DisGeReEXT: Knowledge discovery by analysis of gene disease relations from Medline
DisGeReEXT presents a literature wide analysis study (LWAS) to extract both direct and indirect gene disease associations using joint ensemble learning(explicit) along with concept profiling using ABC principle(implicit) for prioritizing and rationalize potential informative discoveries of the genetic role on diseases. The server also provides ranked the informative sentences using a scoring model and calculated the Disease-Disease similarity using functional association among shared genes
[Publication: here]
Availability:
[Python, Java Script, CytoScape Web and Apache ]

7. Automatic extraction of gene-disease associations from literature using joint ensemble learning: GDMiner
This work employed a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively.
[Publication: here]
Availability:

Availability:

[Word2Vec, EnsembleSVM]

8. BCC-NER: Bidirectional, Contextual Clues Named Entity Tagger for Gene/Protein Mention Recognition
BCC-NER is deployed with three modules. The first module is for text processing which includes basic NLP pre-processing, feature extraction and feature selection. The second module is for training and model building with bidirectional conditional random fields (CRF) to parse the text in both directions (forward and backward) and integrate the backward and forward trained models using margin infused relaxed algorithm (MIRA). The third and final module is for post-processing to achieve a better performance, which includes surrounding text features, parenthesis mismatching and two-tier abbreviation algorithm.
[Publication: here]
Availability:
[Java, JSP, Java Script and Apache ]

9. Text mining and network analysis to find functional associations of genes in high altitude diseases
In this work we identified the gene networks responsible for high altitude diseases by using the principle of gene co-occurrence statistics from literature and network analysis. First, we mined the literature data from PubMed on high-altitude diseases, and extracted the co-occurring gene pairs. Next, based on their co-occurrence frequency, gene pairs were ranked. Finally, a gene association network was created using statistical measures to explore potential relationships. Network analysis results revealed that EPO, ACE, IL6 and TNF are the top five genes that were found to co-occur with 20 or more genes, while the association between EPAS1 and EGLN1 genes is strongly substantiated
[Publication: here]
Availability:

[Java, CytoScape,Clustering]

10. Genomic analysis of RNA-Seq and sRNA-Seq data identifies potential regulatory sRNAs and their functional roles in Staphylococcus aureus
In-silico analysis of coding and non-coding RNA-Seq data to explore their functional association in resistant S. aureus. A global network was generated using opposite-expression and co-expression analysis to explore potential associations. 9 sRNAs and 10 genes were identified as potential hubs. The critical role of the non-coding RNA sRNA95 as a drug target was suggested.
[Publication: here]
Availability:
[whole-genome sRNA-gene network of Staphylococcus aureus]

11. Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature
This work employed a hybrid approach that integrates an ensemble-learning framework by combining a Multiscale Laplacian Graph kernel and a feature-based linear kernel, using a pattern-matching engine to identify biomedical events with arguments. This graph-based kernel not only captures the topological relationships between the individual event nodes but also identifies the associations among the subgraphs for complex events. In addition, the lexico-syntactic patterns were used to automatically discover the semantic role of each word in the sentence. [Publication: here]
Availability:

[Multiscale Laplacian Graph kernel, Bio Event]


12. The potential role of procyanidin as a therapeutic agent against SARS-CoV-2: a text mining, molecular docking and molecular dynamics simulation approach
This study used text mining and named entity recognition method to identify co-occurrence of the important COVID 19 genes/proteins in the interaction network based on the frequency of the interaction. Network analysis revealed a set of genes/proteins, highly dense genes/protein clusters and sub-networks of Angiotensin-converting enzyme 2 (ACE2), Helicase, spike (S) protein (trimeric), membrane (M) protein, envelop (E) protein, and the nucleocapsid (N) protein. The isolated proteins are screened against procyanidin-a flavonoid from plants using molecular docking. Further, molecular dynamics simulation of critical proteins such as ACE2, Mpro and spike proteins are performed to elucidate the inhibition mechanism.
[Publication: here]
Availability:
[COVID 19, Text Mining, Procyanidin, Flavonoids]

13. Molecular Mechanism of T-2 Toxin-Induced Cerebral Edema by Aquaporin-4 Blocking and Permeation
The present study aimed to reveal the molecular mechanism of T-2 toxin-induced cerebral edema by aquaporin-4 (AQP4) blocking and permeation. AQP4 is a class of aquaporin channels that is mainly expressed in the brain, and its structural changes lead to life-threatening complications such as cardio-respiratory arrest, nephritis, and irreversible brain damage. We employed molecular dynamics simulation, text mining, and in vitro and in vivo analysis to study the structural and functional changes induced by the T-2 toxin on AQP4.
[Publication: here]
Availability:
[T-2 toxin-induced cerebral edema, Aquaporin-4 (AQP4)]