python - Convert a list of words to a list of integers in scikit-learn -


i want convert list of words list of integers in scikit-learn, , corpus consists of list of lists of words. e.g. corpus can bunch of sentences.

i can follows using sklearn.feature_extraction.text.countvectorizer, there simpler way? suspect may missing countvectorizer functionalities, it's common pre-processing step in natural language processing. in code first fit countvectorizer, have iterate on each words of each list of words generate list of integers.

import sklearn import sklearn.feature_extraction import numpy np  def reverse_dictionary(dict):     '''     http://stackoverflow.com/questions/483666/python-reverse-inverse-a-mapping     '''     return {v: k k, v in dict.items()}  vectorizer = sklearn.feature_extraction.text.countvectorizer(min_df=1)  corpus = ['this first document.',         'this second second document.',         'and third one.',         'is first document? right.',]  x = vectorizer.fit_transform(corpus).toarray()  tokenizer = vectorizer.build_tokenizer() output_corpus = [] line in corpus:      line = tokenizer(line.lower())     output_line = np.empty_like(line, dtype=np.int)     token_number, token in np.ndenumerate(line):         output_line[token_number] = vectorizer.vocabulary_.get(token)      output_corpus.append(output_line) print('output_corpus: {0}'.format(output_corpus))  word2idx = vectorizer.vocabulary_ print('word2idx: {0}'.format(word2idx))  idx2word = reverse_dictionary(word2idx) print('idx2word: {0}'.format(idx2word)) 

outputs:

output_corpus: [array([9, 3, 7, 2, 1]), # 'this first document.'                 array([9, 3, 7, 6, 6, 1]), # 'this second second document.'                 array([0, 7, 8, 4]), # 'and third one.'                 array([3, 9, 7, 2, 1, 9, 3, 5])] # 'is first document? right.' word2idx: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3, u'one': 4,            u'second': 6, u'the': 7, u'document': 1, u'first': 2} idx2word: {0: u'and', 1: u'document', 2: u'first', 3: u'is', 4: u'one', 5: u'right',             6: u'second', 7: u'the', 8: u'third', 9: u'this'} 

i don't know if there more direct way, can simplify syntax using map instead of for-loop iterate on each word.

and can use build_analyzer(), handles both preprocessing , tokenization, there no need call lower() explicitly.

analyzer = vectorizer.build_analyzer() output_corpus = [map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line)) line in corpus] # python 3.x should # [list(map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line))) line in corpus] 

output_corpus:

[[9, 3, 7, 2, 1], [9, 3, 7, 6, 6, 1], [0, 7, 8, 4], [3, 9, 7, 2, 1, 9, 3, 5]] 

edit

thanks @user3914041, using list comprehension might preferable in case. avoids lambda can faster map. (according python list comprehension vs. map , simple tests.)

output_corpus = [[vectorizer.vocabulary_.get(x) x in analyzer(line)] line in corpus] 

Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -