Posts
Comments
I think the VCF would tell you if you had it. Another possibility would be using a lower quality threshold for calling SNPs, but that seems unlikely.
Thanks for the explanation and tips! I used your procedure and ended up with the same 131MB file. Interestingly I did not need to remove the "--" entries. I have been exchanging email with BGI and they indicated files could have significantly different number of entries (but I am surprised at >3x!). Is there any chance your sequencing had greater than 4x coverage? My VCF file is queued up and should be available in a few months which should help clarify what I am seeing.
I think Promethease (http://promethease.com) is a good and inexpensive ($5) start. If you have both sets of results I would recommend using 23andMe given my experience with uploading BGI data. Web searching "promethease review" will give some details and alternatives. Hopefully those of us in the BGI study can work out a good way of analyzing that data.
I received a similar email and was able to download my genome file a few days ago. The file is 23andMe format output by Plink. It was text even though it had a .gz suffix. I had trouble uploading the file to Promethease, but was able to get it working by changing the header to one copied from an actual 23andMe file and removing the missing (--) SNPs. Unfortunately, despite being ~125MB (~5x the size of an example 23andMe file I have) my file is missing many of the 23andMe SNPs (7948 genotypes annotated in Promethease vs. 20k+ for the 23andMe example). I have an email in to BGI requesting additional information. For example, Promethease directly supports the dbSNPAnnotated.bz2 Complete Genomics file and I was hoping to get a copy of that file for my data.
Have you had any success analyzing your results? Would anyone be interested in starting a discussion group for analyzing our BGI results?