I've just recently begun learning about machine learning, in particular neural networks, logistic regression etc. In an effort to identify significant SNPs that are common to my own ailments (in particular essential tremor and generalized anxiety) where no conclusive links exist, I was wondering about a potential crossover of these two fields on a large scale. In my opinion, the first step to a cure for anything, even aging, is to first be able to diagnose the disease correctly. It seems many ailments presenting symptoms outside of the standard ones (cancer, diabetes etc) have no known cause (essential tremor for example, which whilst believed to be hereditary there as yet are no studies I'm aware of showing any SNP's of statistical significance that may be the cause). In fact, I'd say even for the majority of diseases, we don't know the specific combinations of SNP's that may cause them.
Biology is the most complicated problem to date presented to mankind. I don't believe we can possibly identify the disease causing SNP's or other gene mutations (nonsense, insertions, deletions etc) for all diseases in a laboratory and through guesswork, or clinical observation. We must somehow employ machine learning to assist in solving this problem. Once we know the mutations responsible for most diseases, we will then know what to specifically target for a cure via gene therapy or whatever other available technology there is in the future.
I therefore have a question, and I do apologize in advance if it is simplistic, as I am new to this area of study. This is a hypothetical question as I doubt computational power yet exists to compute this, but given that computation power and time were sufficient (parallel processing on a massive cloud or something as such), would the following project lead to any useful results?
Basically the idea is this: Train a neural network with n features, where n is the set of all known SNP's and other mutations that can be represented categorically somehow. So most likely, 10,000,000+ features which will expand with time. Then, let m be our samples. We will have a sample of let's say, 100,000 people which have some disease X, and another 100,000 people who don't have this disease as a control. We randomly mix the data together to create a 200,000 sample size called m. We categorize the output either as 0 or 1 (0 is disease, 1 is not). We then divide this dataset into whatever the preferred standard is (let's say 60% of m is the training set, 20% a cross validation set, 20% a test set).
The question now. After we have trained the neural network and assigned all the appropriate regularization, hidden layers etc, would you expect any statistically significant prediction on the 20% test set? And if so, could we then go on to identify the features (SNPs), perhaps via unsupervised clustering, that are statistically significant in terms of being a cause for this disease X?