r - How can I make the loop to count the gene against query id -


i have data frame in r 14 columns , 4.4 million rows.

column 1 has query id , column 4 has gene name.

i want make data frame can show , how many genes corresponding each query id.

i have 44k different query ids , each query have maximum ~100 genes hit

csai_contig04661_6  sp  o65396  gcst    arath   86.03   408 56  1   72  478 1   408 0.0e+00 738.0 csai_contig04661_6  sp  q681y3  y1099   arath   22.55   337 244 10  140 474 103 424 8.0e-09 56.6 csai_contig04661_6  sp  q9flr5  smc6a   arath   24.27   103 66  3   04. jun 249 342 441 4.6e+00 28. sep csai_contig04661_6  sp  q9lqi7  gcst    arath   24.28   74  47  2   17. aug 300 31  100 8.1e+00 27. jul csai_contig04661_6  sp  p56795  rk22    arath   28.95   76  49  4   11. mrz 509 15  87  8.4e+00 27. mrz csai_isotig00001_4  sp  q8vze4  pp299   arath   29.63   108 55  5   31. jul 307 10  109 1.6e+00 30. apr 

i interested in type of output.

 csai_contig04661_6       gcst       2      y1099      1       smc6a      1       rk22       1 

how can make loop check column 1 until have same query (for example in example has 6 ) , go column 4 , find how many genes present , count number if more 1 (in example against first query gcst present 2 times)

you can accomplish dplyr:

group_by(df, v1, v4) %>%      summarise(n=n()) %>%      group_by(v1) %>%      summarise(hits=paste(paste(v4, n), collapse=" ")) 

Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -