python - ValueError: empty vocabulary; perhaps the documents only contain stop words -
i'm using (for first time) scikit library , got error:
valueerror: empty vocabulary; perhaps documents contain stop words file "c:\users\a605563\desktop\velibprojetpreso\traitementtwitterdico.py", line 33, in <module> x_train_counts = count_vect.fit_transform(filetweets) file "c:\python27\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform self.fixed_vocabulary_) file "c:\python27\lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab raise valueerror("empty vocabulary; perhaps documents contain stop words
but don't understand why that's happening.
import sklearn sklearn.feature_extraction.text import countvectorizer import pandas pd import numpy import unicodedata import nltk tweetsfile = open('tweets2015-08-13.csv', 'r+') f2 = open('analyzer.txt', 'a') print tweetsfile.readline() count_vect = countvectorizer(strip_accents='ascii') filetweets = tweetsfile.read() filetweets = filetweets.decode('latin1') filetweets = unicodedata.normalize('nfkd', filetweets).encode('ascii','ignore') print filetweets line in tweetsfile: f2.write(line.replace('\n', ' ')) tweetsfile = f2 print type(filetweets) x_train_counts = count_vect.fit_transform(filetweets) print x_train_counts.shape tweetsfile.close()
my data raw tweets:
11/8/2015 @ paris marriott champs elysees hotel " 2015-08-11 21:27:15,"i'm @ paris marriott hotel champs-elysees in paris, fr <https://t.co/gafspvw6fc>" 2015-08-11 21:24:08,"i'm @ 4 seasons hotel george v in paris, ile-de-france <https://t.co/dtpalvziwy>" 2015-08-11 21:22:11, . @ avenue des champs-elysees <https://t.co/8b7u05oaxg> 2015-08-11 20:54:18,her pistol go @ raspoutine paris (official) <https://t.co/le9l3dtdgm> 2015-08-11 20:50:14,"desde paris, con amor. @ avenue des champs-elysees <https://t.co/r68jv3nt1z>"
does know what's happening here?
i found solution, here code :
import sklearn sklearn.feature_extraction.text import countvectorizer import pandas pd import numpy np import unicodedata import nltk import stringio tweetsfile = open('tweets2015-08-13.csv','r+') yourresult = [line.split(',') line in tweetsfile.readlines()] count_vect = countvectorizer(input="file") docs_new = [ stringio.stringio(x) x in yourresult ] x_train_counts = count_vect.fit_transform(docs_new) vocab = count_vect.get_feature_names() print x_train_counts.shape
Comments
Post a Comment