r - How can I make the loop to count the gene against query id -
i have data frame in r 14 columns , 4.4 million rows.
column 1 has query id , column 4 has gene name.
i want make data frame can show , how many genes corresponding each query id.
i have 44k different query ids , each query have maximum ~100 genes hit
csai_contig04661_6 sp o65396 gcst arath 86.03 408 56 1 72 478 1 408 0.0e+00 738.0 csai_contig04661_6 sp q681y3 y1099 arath 22.55 337 244 10 140 474 103 424 8.0e-09 56.6 csai_contig04661_6 sp q9flr5 smc6a arath 24.27 103 66 3 04. jun 249 342 441 4.6e+00 28. sep csai_contig04661_6 sp q9lqi7 gcst arath 24.28 74 47 2 17. aug 300 31 100 8.1e+00 27. jul csai_contig04661_6 sp p56795 rk22 arath 28.95 76 49 4 11. mrz 509 15 87 8.4e+00 27. mrz csai_isotig00001_4 sp q8vze4 pp299 arath 29.63 108 55 5 31. jul 307 10 109 1.6e+00 30. apr
i interested in type of output.
csai_contig04661_6 gcst 2 y1099 1 smc6a 1 rk22 1
how can make loop check column 1 until have same query (for example in example has 6 ) , go column 4 , find how many genes present , count number if more 1 (in example against first query gcst present 2 times)
you can accomplish dplyr:
group_by(df, v1, v4) %>% summarise(n=n()) %>% group_by(v1) %>% summarise(hits=paste(paste(v4, n), collapse=" "))
Comments
Post a Comment