tfidf-ldada文本分类

Tfidf

为了统计词频,将分词后的字符串分割成词

1
2
X_train = [[word for word in line.split()] for line in X_train]
X_test = [[word for word in line.split()] for line in X_test]
1
2
3
4
5
6
7
8
9
10
11
12
13
def tfidf(X_train):
# print(X_train[:3])
dict = corpora.Dictionary(X_train)
# dict.save('dict1.dict')
corpus = [dict.doc2bow(text) for text in X_train]
# print(corpus[:3])
tfidf_model = models.TfidfModel(corpus) # num_docs=100805, num_nnz=1761406
# test = ['反映', 'xx', ... , 'xx', '问题']
# print(dict.doc2bow(test))
# print(tfidf_model[dict.doc2bow(test)])
corpus_tfidf = tfidf_model[corpus]
# corpus_tfidf.save('corpus_tfidf.corp')
return corpus_tfidf, tfidf_model, dict # 双层list

corpurs是将词转化为(id, 词频)的形式

corpus_tfidf是(id,tfidf权重)的形式

通过tfidf_model[anycoupus],可以将任意的经过dict.doc2bow转换后的词频corpus转化为tfidf corpus,这里的corpus_tfidf = tfidf_model[corpus], 只是转化了自身。

LDA

lda用在文本分类中本质是降维,lda有多少个主题,就可以将一个句子映射到多少维,表示属于每一个主题的可能性,再将这个embedding使用分类器进行分类。

1
lda = models.ldamulticore.LdaMulticore(corpus=corpus_tfidf, id2word=dict, num_topics=100)

之后可以通过 lda.inference([tfidf_model[dict.doc2bow(i)]])[0][0] 得到一个新的词组的embedding

1
2
3
4
5
6
7
8
9
train, test = [], []
for i in X_train:
train.append(lda.inference([tfidf_model[dict.doc2bow(i)]])[0][0])
for i in X_test:
test.append(lda.inference([tfidf_model[dict.doc2bow(i)]])[0][0])

train, test = np.asarray(train), np.asarray(test)
np.save('train.npy', train)
np.save('test.npy', test)

分类

1
2
3
4
5
6
7
train, test = np.load('train.npy').tolist(), np.load('test.npy').tolist()

print("------Training svm classifier------")
sv = SVC(C=1, kernel='rbf')
sv.fit(train, y_train)
pred_y = sv.predict(test)
print(classification_report(y_test, pred_y))