r - Extracting columns from text file -


i load text file (tree.txt) r, below content (copy pasted jweka - j48 command). use following command load text file:

data3 <-read.table (file.choose(), header = false,sep = ",") 

i insert each column separate variables named following format col1, col2 ... col8 (in example since have 8 columns). if load excel delimited separation each row separated in 1 column (this required result). each coln contain relevant characters of tree in example. how can separate , insert text file these columns automatically while ignoring header , footer content of file?

here text file content:

[[1]]                                                                j48 pruned  tree                                                         ------------------                                                                mstv    <=  0.4                                                      |   mltv    <=  4.1:    3   -2                                           |   mltv    >   4.1                                                  |   |   astv    <=  79                                               |   |   |   b   <=  1383:00:00  2   -18                                  |   |   |   b   >   1383                                             |   |   |   |   uc  <=  05:00   1   -2                               |   |   |   |   uc  >   05:00   2   -2                               |   |   astv    >   79:00:00    3   -2                                       mstv    >   0.4                                                      |   dp  <=  0                                                    |   |   altv    <=  09:00   1   (170.0/2.0)                                      |   |   altv    >   9                                                |   |   |   fm  <=  7                                            |   |   |   |   lbe <=  142:00:00   1   (27.0/1.0)                               |   |   |   |   lbe >   142                                      |   |   |   |   |   ac  <=  2                                    |   |   |   |   |   |   e   <=  1058:00:00  1   -5                       |   |   |   |   |   |   e   >   1058                                 |   |   |   |   |   |   |   dl  <=  04:00   2   (9.0/1.0)                    |   |   |   |   |   |   |   dl  >   04:00   1   -2                   |   |   |   |   |   ac  >   02:00   1   -3                           |   |   |   fm  >   07:00   2   -2                                   |   dp  >   0                                                    |   |   dp  <=  1                                                |   |   |   uc  <=  03:00   2   (4.0/1.0)                                    |   |   |   uc  >   3                                            |   |   |   |   mltv    <=  0.4:    3   -2                               |   |   |   |   mltv    >   0.4:    1   -8                               |   |   dp  >   01:00   3   -8                                        number  of  leaves  :   16                                                size    of  tree    :   31 

an example of col1 content be: mstv | | | | | | | | mstv | | | | | | | | | | | | | | | | | | | |

col2 content be: mltv mltv | | | | | | > dp | | | | | | | | | | | | dp | | | | | |

try this:

cleaned.txt <- capture.output(cat(paste0(tail(head(readlines("file_location"), -4), -4), collapse = '\n'), sep = '\n')) cleaned.df <- read.fwf(file = textconnection(cleaned.txt),                     header = false,                     widths = rep.int(4, max(nchar(cleaned.txt)/4)),                    strip.white= true                    ) cleaned.df <- cleaned.df[,colsums(is.na(cleaned.df))<nrow(cleaned.df)] 

for cleaning process, end using combination of head , tail remove 4 spaces on top , bottom. there's more efficient way outside of r, isn't bad. generally, i'm making file readable r.

your file looks fixed-width file use read.fwf, , use textconnection() point function cleaned output.

finally, i'm not sure how data structured, when copied stackoverflow, pasted bunch of whitespace @ end of each line. i'm using tricks guess @ how long file is, , removing extraneous columns on here

widths = rep.int(4, max(nchar(cleaned.txt)/4)) cleaned.df <- cleaned.df[,colsums(is.na(cleaned.df))<nrow(cleaned.df)] 

next, i'm creating data in way structured.

for (i in colnames(cleaned.df)) {   assign(i, subset(cleaned.df, select=i))   assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = false))) }  rm(i) rm(cleaned.df) rm(cleaned.txt) 

what creates loop each column header in data frame.

from there uses assign() put data in each column its' own data frame. in case, named v1 through v15.

next, uses combination of cat() , paste() unlist() capture.output() concatenate list single character vectors, each of data frames, character vectors, instead of data frames.

keep in mind because wanted space @ each new character, i'm using space separator. because fixed-width file, columns blank, i'm removing using

get(i)[get(i)!=""] 

(your question said wanted col2 be: mltv mltv | | | | | | > dp | | | | | | | | | | | | dp | | | | | |).

if use get(i), there leading whitespace in output.


Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -