python - ValueError: empty vocabulary; perhaps the documents only contain stop words -


i'm using (for first time) scikit library , got error:

valueerror: empty vocabulary; perhaps documents contain stop words file "c:\users\a605563\desktop\velibprojetpreso\traitementtwitterdico.py", line 33, in <module> x_train_counts = count_vect.fit_transform(filetweets) file "c:\python27\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform self.fixed_vocabulary_) file "c:\python27\lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab raise valueerror("empty vocabulary; perhaps documents contain stop words 

but don't understand why that's happening.

import sklearn sklearn.feature_extraction.text import countvectorizer import pandas pd import numpy import unicodedata import nltk   tweetsfile = open('tweets2015-08-13.csv', 'r+') f2 = open('analyzer.txt', 'a') print tweetsfile.readline() count_vect = countvectorizer(strip_accents='ascii') filetweets =  tweetsfile.read() filetweets = filetweets.decode('latin1') filetweets = unicodedata.normalize('nfkd', filetweets).encode('ascii','ignore') print filetweets line in tweetsfile:     f2.write(line.replace('\n', ' ')) tweetsfile = f2 print type(filetweets) x_train_counts = count_vect.fit_transform(filetweets) print x_train_counts.shape tweetsfile.close() 

my data raw tweets:

11/8/2015 @ paris marriott champs elysees hotel " 2015-08-11 21:27:15,"i'm @ paris marriott hotel champs-elysees in paris, fr <https://t.co/gafspvw6fc>" 2015-08-11 21:24:08,"i'm @ 4 seasons hotel george v in paris, ile-de-france <https://t.co/dtpalvziwy>" 2015-08-11 21:22:11,    . @ avenue des champs-elysees <https://t.co/8b7u05oaxg> 2015-08-11 20:54:18,her pistol go @ raspoutine paris (official) <https://t.co/le9l3dtdgm> 2015-08-11 20:50:14,"desde paris, con amor. @ avenue des champs-elysees <https://t.co/r68jv3nt1z>" 

does know what's happening here?

i found solution, here code :

import sklearn sklearn.feature_extraction.text import countvectorizer import pandas pd import numpy np import unicodedata import nltk  import stringio   tweetsfile = open('tweets2015-08-13.csv','r+') yourresult = [line.split(',') line in tweetsfile.readlines()] count_vect = countvectorizer(input="file") docs_new = [ stringio.stringio(x) x in yourresult ] x_train_counts = count_vect.fit_transform(docs_new) vocab = count_vect.get_feature_names() print x_train_counts.shape 

Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -