Since this blog’s main purpose is about what I do here at the university I will add this separated post about it.
One of the things we have done so far is applying a machine learning technique called Random Forrest on the data in order to create a model. This model will be trained by a training set to learn and recognize the difference between healthy and sick samples. After we learned the model to do this, we can test it with a test set, an validate with the validation set. By doing this we can trace down, which genes are the most important for the model to decide when a sample I classified as sick or healthy. This could potentially be very interesting. By checking the literature if these genes are connected to a certain disease in this case liver disease. If this is the case, then perfect the model uses these genes, so this gene can possibly be used as a biomarker for metagenomic data. If there is nothing to be found about certain genes linked to liver disease and the calculations show that there is a significant difference between healthy and sick for this gene than maybe who knows, this could potentially be a new biomarker linked to this disease?
So far the accuracy of the model is far from perfect and the validation set proved to be of bad quality to use as validation set. A lot more work to do. Also looking for these genes in the literature proved to be far from the most easy task. Since we collect not actual genes, but the K numbers. These need to be linked via KO assignment and KEGG mapping in order to find matches in the literature.
So far I am happy with the progress I am making, I feel much better and more confident at what I did today than a few weeks ago which is the main purpose of being here in the first place. I get great support from the team and especially from my supervisor Maja Kuzman, so you won’t hear me complain. I think that’s about the most important thing I can say about what I do without going to deep into the details, hope you find it as interesting as I do.
Kommentare