python - Convert a list of words to a list of integers in scikit-learn -
i want convert list of words list of integers in scikit-learn, , corpus consists of list of lists of words. e.g. corpus can bunch of sentences.
i can follows using sklearn.feature_extraction.text.countvectorizer
, there simpler way? suspect may missing countvectorizer functionalities, it's common pre-processing step in natural language processing. in code first fit countvectorizer, have iterate on each words of each list of words generate list of integers.
import sklearn import sklearn.feature_extraction import numpy np def reverse_dictionary(dict): ''' http://stackoverflow.com/questions/483666/python-reverse-inverse-a-mapping ''' return {v: k k, v in dict.items()} vectorizer = sklearn.feature_extraction.text.countvectorizer(min_df=1) corpus = ['this first document.', 'this second second document.', 'and third one.', 'is first document? right.',] x = vectorizer.fit_transform(corpus).toarray() tokenizer = vectorizer.build_tokenizer() output_corpus = [] line in corpus: line = tokenizer(line.lower()) output_line = np.empty_like(line, dtype=np.int) token_number, token in np.ndenumerate(line): output_line[token_number] = vectorizer.vocabulary_.get(token) output_corpus.append(output_line) print('output_corpus: {0}'.format(output_corpus)) word2idx = vectorizer.vocabulary_ print('word2idx: {0}'.format(word2idx)) idx2word = reverse_dictionary(word2idx) print('idx2word: {0}'.format(idx2word))
outputs:
output_corpus: [array([9, 3, 7, 2, 1]), # 'this first document.' array([9, 3, 7, 6, 6, 1]), # 'this second second document.' array([0, 7, 8, 4]), # 'and third one.' array([3, 9, 7, 2, 1, 9, 3, 5])] # 'is first document? right.' word2idx: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3, u'one': 4, u'second': 6, u'the': 7, u'document': 1, u'first': 2} idx2word: {0: u'and', 1: u'document', 2: u'first', 3: u'is', 4: u'one', 5: u'right', 6: u'second', 7: u'the', 8: u'third', 9: u'this'}
i don't know if there more direct way, can simplify syntax using map
instead of for-loop iterate on each word.
and can use build_analyzer()
, handles both preprocessing , tokenization, there no need call lower()
explicitly.
analyzer = vectorizer.build_analyzer() output_corpus = [map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line)) line in corpus] # python 3.x should # [list(map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line))) line in corpus]
output_corpus:
[[9, 3, 7, 2, 1], [9, 3, 7, 6, 6, 1], [0, 7, 8, 4], [3, 9, 7, 2, 1, 9, 3, 5]]
edit
thanks @user3914041, using list comprehension might preferable in case. avoids lambda
can faster map
. (according python list comprehension vs. map , simple tests.)
output_corpus = [[vectorizer.vocabulary_.get(x) x in analyzer(line)] line in corpus]
Comments
Post a Comment