NLP: Target and aspect detection with Python.

Publicado por

In this post we perform target and aspect detection on a dataset about tripadvisor opinions.

Target or topic are the words or topics the opinions are about. Aspects are parts or features of the target.

Here we explore the target detection using word embeddings (Word2Vec) which extracts similar words by context and try to extract aspects of the target by searching close words wusing the WordNet synsets.

First, we perform data preprocessing by removing stopwords and punctuation marks, convert to lowercase and lemmatize (get the words base). Then we perform target and aspect detection. Finally we perform some clustering in order to group the aspects found.

Aspect and sentiment detection
Data loading
In [1]:
import pandas as pd
import gensim
import nltk
from nltk import word_tokenize
from nltk.collocations import *
from nltk.stem.wordnet import WordNetLemmatizer
import re

tripadvisor_data = pd.read_csv('tripadvisor_data.csv')
#tripadvisor_data

Let's preprocess data. Data has a short field (headline) and a long description.

Let's remove stopwords (using NLTK libraries) and punctuation marks. We also convert all the words to lowercase and lemmatize words (get the words base) with WordNetLemmatizer

In [2]:

tripadvisor_data.head()
Out[2]:
Short Long Class Opinion
0 “Very nice atmosphere” We were together with some friends at the Anew... family POS
1 “Very nice food, great atmosphere, feels like ... Martin and his staff are truely great! They ma... family POS
2 “Best Hotel on the Planet” We have stayed at the Excelsior on numerous oc... family POS
3 “What a vacation should be” Having four days free in Milan, we decided to ... friends POS
4 “Excellent stay” In all aspects an excellent stay. Professional... couple POS
In [3]:
#Remove punctuation marks and convert to lowercase

short=tripadvisor_data['Short'].str.strip('".,;:-():!?-‘’“” ').str.lower()
long=tripadvisor_data['Long'].str.strip('"./,;:-():!?-‘’“” ').str.lower()



#Remove stopwords
stopwords = []

#Import NLTK list of stopwords. 
stopwords = nltk.corpus.stopwords.words('english')

short = short.apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))
long = long.apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))


short
Out[3]:
0                                    nice atmosphere
1       nice food, great atmosphere, feels like home
2                                  best hotel planet
3                                           vacation
4                                     excellent stay
                            ...                     
1456                      comfortable delicious food
1457                   meraviglioso .... come sempre
1458                                 wonderful hotel
1459                   warm mountain lodge dolomites
1460           nice location overpriced poor service
Name: Short, Length: 1461, dtype: object

Let's proceed with lemmatization for short and long texts

In [4]:
#Import NLTK methods to lemmatize with Wordnet
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for s in short:
    s = ' '.join([lemmatizer.lemmatize(w) for w in s])

for l in long:
    l =  ' '.join([lemmatizer.lemmatize(w) for w in l])
print(short[0:5])
print(long[0:5])
0                                 nice atmosphere
1    nice food, great atmosphere, feels like home
2                               best hotel planet
3                                        vacation
4                                  excellent stay
Name: Short, dtype: object
0    together friends anewandter hof. took 4 apartm...
1    martin staff truely great! make feel home. che...
2    stayed excelsior numerous occasions, summer wi...
3    four days free milan, decided join friends alr...
4    aspects excellent stay. professional friendly ...
Name: Long, dtype: object
In [5]:
#Join short and long texts
texto=short.append(long)
texto[0:5]
Out[5]:
0                                 nice atmosphere
1    nice food, great atmosphere, feels like home
2                               best hotel planet
3                                        vacation
4                                  excellent stay
dtype: object

Load the Word2Vec model.

In [6]:
from gensim.models import Word2Vec,  KeyedVectors

model = KeyedVectors.load_word2vec_format('model_w2v')
Let's extract 10 target candidates by searching words similar to hotel, according to gensim.
In [7]:


#Let's calculate the similarity of every word with hotel
words=[]
wordsim=[]
similarity=[]
for t in texto:
    for w in word_tokenize(t):
        if(w in model.wv.vocab):
            #Evitamos repetir palabras
            if(w not in words):
                words.append(w)
                wordsim.append(w)
                wordsim.append(model.similarity(w,'hotel'))
                similarity.append(wordsim)
                wordsim=[]

#Sort and get the 10 most similar words

similarity.sort(key = lambda x: x[1],reverse=True) 
print(similarity[0:10])
    
[['hotel', 1.0000001], ['hotels', 0.8347789], ['restaurant', 0.8012507], ['inn', 0.7934126], ['apartment', 0.7811176], ['mansion', 0.7551883], ['hilton', 0.74470633], ['resort', 0.7422404], ['shopping', 0.7336252], ['rented', 0.7322151]]
Let's play a little bit and adjust the results by adding negative words to the function most_similar (substracts those vectors from the word, to get similar words).
In [8]:
#Let's add some restoration and shopping words as negative 
most_similars = model.wv.most_similar(positive=['hotel','inn','apartment','resort'],negative=['restaurant','cafe','shopping'])
most_similars
Out[8]:
[('mansion', 0.6289756298065186),
 ('oceanfront', 0.5899235010147095),
 ('lodge', 0.589352011680603),
 ('coronado', 0.5851401090621948),
 ('residence', 0.579605221748352),
 ('pines', 0.5751224756240845),
 ('balmoral', 0.5652951598167419),
 ('villas', 0.5619324445724487),
 ('condominium', 0.5599920749664307),
 ('heights', 0.5514863729476929)]
In [9]:
#Now let's try with some location related words as negative
most_similars = model.wv.most_similar(positive=['hotel','inn','apartment','resort'],negative=['downtown','oceanfront'])
most_similars
Out[9]:
[('restaurant', 0.7719864249229431),
 ('motel', 0.7122789621353149),
 ('hotels', 0.710576057434082),
 ('lodge', 0.7067229747772217),
 ('cafe', 0.6683881878852844),
 ('rented', 0.6660130620002747),
 ('luxury', 0.657811164855957),
 ('hilton', 0.6566590070724487),
 ('posh', 0.6548078656196594),
 ('residence', 0.6500182747840881)]
Let's calculate 50 aspect candidates from WordNet synsets (similar words). We are going to search the 50 closest sysnsets to 'hotel' according to Wu and Palmer distance. We are going to use the one word nouns and adjective-noun bigrams.
In [10]:
from nltk.util import ngrams

#Search for noun and adjective-noun.
def is_np(candidate):
    test = False
    tokens = candidate
    tagged_tokens = nltk.pos_tag(tokens)
    if len(tagged_tokens) > 1:
        PoS_initial = tagged_tokens[0][1][:2]
        PoS_final = tagged_tokens[-1][1][:2]
        if ((PoS_initial == 'JJ') and (PoS_final == 'NN' or PoS_final == 'NS')):
            test = True
    else: #If it's not greater than 1, then the length is 1.
         PoS_initial = tagged_tokens[0][1][:2]
         if (PoS_initial == 'NN' or PoS_initial == 'NS'):
             test=True
    return test

#Filter nouns and adjective-noun bigrams
bigrams=[]
aspect_candidates=[]
#Lowercase token list
for t in texto:
    tokens = [w for w in word_tokenize(t)]
    #Buscamos candidatos
    for t in tokens:
        if(is_np(t)):
            aspect_candidates.append(t)
    #Buscamos bigramas 
    bigrams=(list(ngrams(tokens, 2)))
    for b in bigrams:
        if(is_np(b)):
            aspect_candidates.append(b[0]+'_'+b[1])
In [11]:
aspect_candidates[1:5]
Out[11]:
['feels', 'nice_food', 'great_atmosphere', 'best_hotel']
In [12]:
#Let's choose the 50 candidates.
from nltk.corpus import wordnet as wn


def get_aspects(syn):
    target=wn.synset(syn)
    simvector=[]
    cands=[]
    for ac in aspect_candidates:
        candidates_synsets=wn.synsets(lemmatizer.lemmatize(ac),pos=wn.NOUN)
    
        for cs in candidates_synsets:
            if "'" + ac.replace(' ','_') + ".n" in str(cs):            
                wup=target.wup_similarity(wn.synset(lemmatizer.lemmatize(ac)+'.n.01'))
                simvector.append(lemmatizer.lemmatize(ac))
                simvector.append(wup)
                if(not(simvector in cands)):
                    cands.append(simvector)
                simvector=[]
            

    cands.sort(key = lambda x: x[1],reverse=True)
    return cands
In [13]:
candidates=get_aspects('hotel.n.01')
In [14]:
#Show the 50 first candidates and their distance.
print(candidates[0:50])
[['fountain', 0.8], ['office', 0.7058823529411765], ['clean_room', 0.7058823529411765], ['nunnery', 0.7], ['felt', 0.6666666666666666], ['common_room', 0.6666666666666666], ['turkish_bath', 0.6666666666666666], ['suiting', 0.6666666666666666], ['shower_stall', 0.631578947368421], ['floor', 0.625], ['public_transport', 0.625], ['slide', 0.625], ['stick', 0.625], ['overload', 0.625], ['still', 0.5882352941176471], ['ski', 0.5882352941176471], ['scenery', 0.5882352941176471], ['overall', 0.5882352941176471], ['ceiling', 0.5882352941176471], ['ski_boot', 0.5882352941176471], ['nest', 0.5714285714285714], ['hot_tub', 0.5555555555555556], ['electric_heater', 0.5555555555555556], ['underpass', 0.5555555555555556], ['main_street', 0.5555555555555556], ['switch', 0.5555555555555556], ['chimney', 0.5555555555555556], ['newspaper', 0.5555555555555556], ['net', 0.5555555555555556], ['chime', 0.5555555555555556], ['stile', 0.5263157894736842], ['footage', 0.5263157894736842], ['front_door', 0.5263157894736842], ['main_drag', 0.5263157894736842], ['single_bed', 0.5263157894736842], ['high_table', 0.5263157894736842], ['snow_tire', 0.5263157894736842], ['folder', 0.5263157894736842], ['salad_bar', 0.5], ['upper', 0.5], ['friend', 0.5], ['undies', 0.5], ['user', 0.5], ['neutral', 0.5], ['oven', 0.5], ['unfortunate', 0.5], ['female', 0.5], ['ski_jump', 0.47619047619047616], ['fellow', 0.47058823529411764], ['child', 0.47058823529411764]]
Let's repeat the aspect selection with other words such as mansion and oceanfront.
In [15]:

#Repeat aspect selection with the word mansion.
candidates2=get_aspects('mansion.n.01')
print(candidates2[0:50])
[['outside', 0.8333333333333334], ['south_side', 0.7692307692307693], ['outdoors', 0.7692307692307693], ['north_side', 0.7692307692307693], ['opening', 0.6666666666666666], ['old_world', 0.6666666666666666], ['front', 0.625], ['open', 0.6153846153846154], ['overlook', 0.6153846153846154], ['northern_europe', 0.6153846153846154], ['new_england', 0.6153846153846154], ['west_coast', 0.6153846153846154], ['outpost', 0.5714285714285714], ['picnic_area', 0.5714285714285714], ['neighborhood', 0.5714285714285714], ['industrial_park', 0.5714285714285714], ['nook', 0.5714285714285714], ['upstairs', 0.5454545454545454], ['iceberg', 0.5454545454545454], ['san_francisco', 0.5333333333333333], ['new_york', 0.5333333333333333], ['switzerland', 0.5], ['flower_garden', 0.5], ['florence', 0.5], ['climb', 0.5], ['united_states', 0.5], ['nest', 0.5], ['green_mountains', 0.5], ['upgrade', 0.5], ['nepal', 0.5], ['trentino-alto_adige', 0.5], ['felt', 0.46153846153846156], ['friend', 0.46153846153846156], ['user', 0.46153846153846156], ['neutral', 0.46153846153846156], ['fountain', 0.46153846153846156], ['unfortunate', 0.46153846153846156], ['uphill', 0.46153846153846156], ['suiting', 0.46153846153846156], ['fellow', 0.42857142857142855], ['floor', 0.42857142857142855], ['public_transport', 0.42857142857142855], ['child', 0.42857142857142855], ['slide', 0.42857142857142855], ['stick', 0.42857142857142855], ['overload', 0.42857142857142855], ['nudist', 0.42857142857142855], ['female', 0.42857142857142855], ['novice', 0.42857142857142855], ['still', 0.4]]
In [16]:

#Let's repeat with the word oceanfront
candidates3=get_aspects('oceanfront.n.01')
print(candidates3[0:50])
[['climb', 0.7272727272727273], ['green_mountains', 0.7272727272727273], ['upgrade', 0.7272727272727273], ['iceberg', 0.7272727272727273], ['uphill', 0.6666666666666666], ['upstairs', 0.6], ['outside', 0.5454545454545454], ['opening', 0.5454545454545454], ['nest', 0.5454545454545454], ['old_world', 0.5454545454545454], ['felt', 0.5], ['open', 0.5], ['south_side', 0.5], ['friend', 0.5], ['user', 0.5], ['neutral', 0.5], ['outdoors', 0.5], ['fountain', 0.5], ['unfortunate', 0.5], ['overlook', 0.5], ['northern_europe', 0.5], ['north_side', 0.5], ['new_england', 0.5], ['suiting', 0.5], ['west_coast', 0.5], ['outpost', 0.46153846153846156], ['picnic_area', 0.46153846153846156], ['fellow', 0.46153846153846156], ['floor', 0.46153846153846156], ['public_transport', 0.46153846153846156], ['child', 0.46153846153846156], ['neighborhood', 0.46153846153846156], ['slide', 0.46153846153846156], ['stick', 0.46153846153846156], ['overload', 0.46153846153846156], ['nudist', 0.46153846153846156], ['female', 0.46153846153846156], ['industrial_park', 0.46153846153846156], ['novice', 0.46153846153846156], ['nook', 0.46153846153846156], ['still', 0.42857142857142855], ['ski', 0.42857142857142855], ['scenery', 0.42857142857142855], ['overall', 0.42857142857142855], ['office', 0.42857142857142855], ['ceiling', 0.42857142857142855], ['skier', 0.42857142857142855], ['fuchs', 0.42857142857142855], ['chicory', 0.42857142857142855], ['clean_room', 0.42857142857142855]]
Let's make some clustering using two different criteria:
- Aspect candidates (synset Wordnet) using Wu and Palmer distance.
- Cosine similarity of the Word2Vec terms (target candidates)
In [17]:


#create distance matrix with the Wu and Palmer distance
from numpy import matrix

def create_vector(synset, synsets_vocabulary):
    vector = [synset.wup_similarity(s) for s in synsets_vocabulary] #Wu and Palmer score
    return vector

#Create a vector with the synsets
#of the 50 best aspect candidates obtained before
vcandidates=[]
for c in candidates[0:50]:
        vcandidates.append(wn.synset(c[0]+'.n.01'))
    
#Create distance vector
vectors = [create_vector(v,vcandidates) for v in vcandidates]

X = matrix(vectors)

print(X)
[[1.         0.75       0.75       ... 0.5        0.5        0.5       ]
 [0.75       1.         0.66666667 ... 0.45454545 0.44444444 0.44444444]
 [0.75       0.66666667 1.         ... 0.45454545 0.44444444 0.44444444]
 ...
 [0.5        0.45454545 0.45454545 ... 1.         0.36363636 0.36363636]
 [0.5        0.44444444 0.44444444 ... 0.36363636 1.         0.66666667]
 [0.5        0.44444444 0.44444444 ... 0.36363636 0.66666667 1.        ]]
In [18]:
from sklearn.cluster import KMeans

#Clustering.Create function

def mykmeans(X,vcandidates):
    num_clusters = [3,4,5,6,7]

    labels=[]
    clusters=[]
    nclusters=[]

    #Save labels to make graph
    for nc in num_clusters:
        km = KMeans(n_clusters=nc, n_init=10) # n_init para mantener la consistencia
        km.fit(X)  

        listlabels=km.labels_.tolist()
        labels.append(listlabels)

        #Save clusters to make list
        cluster=[]
        for ind in range(0,len(vcandidates)):
            c=[]
            c.append(listlabels[ind])
            c.append(vcandidates[ind])
            cluster.append(c)
        #Sort by cluster number
        cluster.sort(key = lambda x: x[0])
        nclusters.append(nc)
        nclusters.append(cluster)
        clusters.append(nclusters)
        nclusters=[]
    return clusters,labels

labels_color_map = {
    0: '#20b2aa', 1: '#ff7373', 2: '#ffe4e1', 3: '#005073', 4: '#4d0404', 5: '#E342D0', 6: '#35C20C', 7: '#F0FA16'
}
In [19]:
#Cluster list
clusters,labels=mykmeans(X,vcandidates)
for cl in clusters:   
    print()
    print('Número de clusters: '+str(cl[0]))
    print()
    print(cl[1])

    
Número de clusters: 3

[[0, Synset('friend.n.01')], [0, Synset('user.n.01')], [0, Synset('neutral.n.01')], [0, Synset('unfortunate.n.01')], [0, Synset('female.n.01')], [0, Synset('chap.n.01')], [0, Synset('child.n.01')], [1, Synset('public_transport.n.01')], [1, Synset('stick.n.01')], [1, Synset('ski.n.01')], [1, Synset('hot_tub.n.01')], [1, Synset('electric_heater.n.01')], [1, Synset('switch.n.01')], [1, Synset('newspaper.n.01')], [1, Synset('internet.n.01')], [1, Synset('chime.n.01')], [1, Synset('stile.n.01')], [1, Synset('footage.n.01')], [1, Synset('single_bed.n.01')], [1, Synset('high_table.n.01')], [1, Synset('salad_bar.n.01')], [1, Synset('upper_berth.n.01')], [1, Synset('ski_jump.n.01')], [2, Synset('fountain.n.01')], [2, Synset('office.n.01')], [2, Synset('clean_room.n.01')], [2, Synset('nunnery.n.01')], [2, Synset('felt.n.01')], [2, Synset('common_room.n.01')], [2, Synset('turkish_bath.n.01')], [2, Synset('suiting.n.01')], [2, Synset('shower_stall.n.01')], [2, Synset('floor.n.01')], [2, Synset('slide.n.01')], [2, Synset('overload.n.01')], [2, Synset('still.n.01')], [2, Synset('scenery.n.01')], [2, Synset('overall.n.01')], [2, Synset('ceiling.n.01')], [2, Synset('ski_boot.n.01')], [2, Synset('nest.n.01')], [2, Synset('underpass.n.01')], [2, Synset('main_street.n.01')], [2, Synset('chimney.n.01')], [2, Synset('front_door.n.01')], [2, Synset('main_drag.n.01')], [2, Synset('snow_tire.n.01')], [2, Synset('booklet.n.01')], [2, Synset('undies.n.01')], [2, Synset('oven.n.01')]]

Número de clusters: 4

[[0, Synset('nunnery.n.01')], [0, Synset('underpass.n.01')], [0, Synset('main_street.n.01')], [0, Synset('chimney.n.01')], [0, Synset('front_door.n.01')], [0, Synset('main_drag.n.01')], [0, Synset('snow_tire.n.01')], [0, Synset('booklet.n.01')], [0, Synset('undies.n.01')], [0, Synset('oven.n.01')], [1, Synset('fountain.n.01')], [1, Synset('office.n.01')], [1, Synset('clean_room.n.01')], [1, Synset('felt.n.01')], [1, Synset('common_room.n.01')], [1, Synset('turkish_bath.n.01')], [1, Synset('suiting.n.01')], [1, Synset('shower_stall.n.01')], [1, Synset('floor.n.01')], [1, Synset('slide.n.01')], [1, Synset('overload.n.01')], [1, Synset('still.n.01')], [1, Synset('scenery.n.01')], [1, Synset('overall.n.01')], [1, Synset('ceiling.n.01')], [1, Synset('ski_boot.n.01')], [1, Synset('nest.n.01')], [2, Synset('friend.n.01')], [2, Synset('user.n.01')], [2, Synset('neutral.n.01')], [2, Synset('unfortunate.n.01')], [2, Synset('female.n.01')], [2, Synset('chap.n.01')], [2, Synset('child.n.01')], [3, Synset('public_transport.n.01')], [3, Synset('stick.n.01')], [3, Synset('ski.n.01')], [3, Synset('hot_tub.n.01')], [3, Synset('electric_heater.n.01')], [3, Synset('switch.n.01')], [3, Synset('newspaper.n.01')], [3, Synset('internet.n.01')], [3, Synset('chime.n.01')], [3, Synset('stile.n.01')], [3, Synset('footage.n.01')], [3, Synset('single_bed.n.01')], [3, Synset('high_table.n.01')], [3, Synset('salad_bar.n.01')], [3, Synset('upper_berth.n.01')], [3, Synset('ski_jump.n.01')]]

Número de clusters: 5

[[0, Synset('fountain.n.01')], [0, Synset('felt.n.01')], [0, Synset('suiting.n.01')], [0, Synset('floor.n.01')], [0, Synset('public_transport.n.01')], [0, Synset('slide.n.01')], [0, Synset('stick.n.01')], [0, Synset('overload.n.01')], [0, Synset('still.n.01')], [0, Synset('scenery.n.01')], [0, Synset('overall.n.01')], [0, Synset('ceiling.n.01')], [0, Synset('ski_boot.n.01')], [0, Synset('nest.n.01')], [1, Synset('ski.n.01')], [1, Synset('hot_tub.n.01')], [1, Synset('electric_heater.n.01')], [1, Synset('switch.n.01')], [1, Synset('newspaper.n.01')], [1, Synset('internet.n.01')], [1, Synset('chime.n.01')], [1, Synset('stile.n.01')], [1, Synset('footage.n.01')], [1, Synset('single_bed.n.01')], [1, Synset('high_table.n.01')], [1, Synset('salad_bar.n.01')], [1, Synset('upper_berth.n.01')], [1, Synset('ski_jump.n.01')], [2, Synset('nunnery.n.01')], [2, Synset('underpass.n.01')], [2, Synset('main_street.n.01')], [2, Synset('chimney.n.01')], [2, Synset('front_door.n.01')], [2, Synset('main_drag.n.01')], [2, Synset('snow_tire.n.01')], [2, Synset('booklet.n.01')], [2, Synset('undies.n.01')], [2, Synset('oven.n.01')], [3, Synset('friend.n.01')], [3, Synset('user.n.01')], [3, Synset('neutral.n.01')], [3, Synset('unfortunate.n.01')], [3, Synset('female.n.01')], [3, Synset('chap.n.01')], [3, Synset('child.n.01')], [4, Synset('office.n.01')], [4, Synset('clean_room.n.01')], [4, Synset('common_room.n.01')], [4, Synset('turkish_bath.n.01')], [4, Synset('shower_stall.n.01')]]

Número de clusters: 6

[[0, Synset('underpass.n.01')], [0, Synset('main_street.n.01')], [0, Synset('chimney.n.01')], [0, Synset('front_door.n.01')], [0, Synset('main_drag.n.01')], [0, Synset('snow_tire.n.01')], [0, Synset('booklet.n.01')], [0, Synset('undies.n.01')], [0, Synset('oven.n.01')], [1, Synset('fountain.n.01')], [1, Synset('office.n.01')], [1, Synset('felt.n.01')], [1, Synset('suiting.n.01')], [1, Synset('floor.n.01')], [1, Synset('slide.n.01')], [1, Synset('overload.n.01')], [1, Synset('still.n.01')], [1, Synset('scenery.n.01')], [1, Synset('overall.n.01')], [1, Synset('ceiling.n.01')], [1, Synset('ski_boot.n.01')], [1, Synset('nest.n.01')], [2, Synset('friend.n.01')], [2, Synset('user.n.01')], [2, Synset('neutral.n.01')], [2, Synset('unfortunate.n.01')], [2, Synset('female.n.01')], [2, Synset('chap.n.01')], [2, Synset('child.n.01')], [3, Synset('public_transport.n.01')], [3, Synset('stick.n.01')], [3, Synset('ski.n.01')], [3, Synset('hot_tub.n.01')], [3, Synset('electric_heater.n.01')], [3, Synset('switch.n.01')], [3, Synset('newspaper.n.01')], [3, Synset('internet.n.01')], [3, Synset('chime.n.01')], [3, Synset('stile.n.01')], [3, Synset('footage.n.01')], [3, Synset('ski_jump.n.01')], [4, Synset('single_bed.n.01')], [4, Synset('high_table.n.01')], [4, Synset('salad_bar.n.01')], [4, Synset('upper_berth.n.01')], [5, Synset('clean_room.n.01')], [5, Synset('nunnery.n.01')], [5, Synset('common_room.n.01')], [5, Synset('turkish_bath.n.01')], [5, Synset('shower_stall.n.01')]]

Número de clusters: 7

[[0, Synset('office.n.01')], [0, Synset('clean_room.n.01')], [0, Synset('nunnery.n.01')], [0, Synset('common_room.n.01')], [0, Synset('turkish_bath.n.01')], [0, Synset('shower_stall.n.01')], [1, Synset('fountain.n.01')], [1, Synset('felt.n.01')], [1, Synset('suiting.n.01')], [1, Synset('floor.n.01')], [1, Synset('slide.n.01')], [1, Synset('overload.n.01')], [1, Synset('still.n.01')], [1, Synset('scenery.n.01')], [1, Synset('overall.n.01')], [1, Synset('ceiling.n.01')], [1, Synset('ski_boot.n.01')], [1, Synset('nest.n.01')], [2, Synset('friend.n.01')], [2, Synset('user.n.01')], [2, Synset('neutral.n.01')], [2, Synset('unfortunate.n.01')], [2, Synset('female.n.01')], [2, Synset('chap.n.01')], [2, Synset('child.n.01')], [3, Synset('single_bed.n.01')], [3, Synset('high_table.n.01')], [3, Synset('salad_bar.n.01')], [3, Synset('upper_berth.n.01')], [4, Synset('underpass.n.01')], [4, Synset('main_street.n.01')], [4, Synset('chimney.n.01')], [4, Synset('front_door.n.01')], [4, Synset('main_drag.n.01')], [5, Synset('public_transport.n.01')], [5, Synset('stick.n.01')], [5, Synset('ski.n.01')], [5, Synset('hot_tub.n.01')], [5, Synset('electric_heater.n.01')], [5, Synset('switch.n.01')], [5, Synset('newspaper.n.01')], [5, Synset('internet.n.01')], [5, Synset('chime.n.01')], [5, Synset('stile.n.01')], [5, Synset('footage.n.01')], [5, Synset('ski_jump.n.01')], [6, Synset('snow_tire.n.01')], [6, Synset('booklet.n.01')], [6, Synset('undies.n.01')], [6, Synset('oven.n.01')]]
Reduce dimensionality with PCA to be able to draw clusters in a 2D graph.
In [20]:
#3 clusters graph

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


reduced_data = PCA(n_components=2).fit_transform(X)

fig, ax = plt.subplots(figsize=(16,10))
for index, instance in enumerate(reduced_data):
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels[0][index]]
    ax.scatter(pca_comp_1, pca_comp_2, c=color)
    ax.annotate(s=str(vcandidates[index]),xy=(pca_comp_1, pca_comp_2))
plt.show()
In [21]:
#7 clusters graph

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


reduced_data = PCA(n_components=2).fit_transform(X)

fig, ax = plt.subplots(figsize=(16,10))
for index, instance in enumerate(reduced_data):
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels[4][index]]
    ax.scatter(pca_comp_1, pca_comp_2, c=color)
    ax.annotate(s=str(vcandidates[index]),xy=(pca_comp_1, pca_comp_2))
plt.show()
In [22]:
#Let's create distance matrix using cosine similarity
#de los candidatos a target.
from numpy import matrix

def create_vector(word, vocabulary):
    vector = [model.similarity(word,s) for s in vocabulary] #Calculamos la cosine similarity
    return vector

#Create a vector with the synsets
#of the 50 best aspect candidates obtained before
scandidates=[]
for c in similarity[0:50]:
        scandidates.append((c[0]))
    
#Creamos el vector de distancias
vectors2 = [create_vector(v,scandidates) for v in scandidates]

Xs = matrix(vectors2)

print(Xs)
[[1.0000001  0.8347789  0.8012507  ... 0.6475095  0.64573866 0.6454631 ]
 [0.8347789  1.         0.65909815 ... 0.4374785  0.53323054 0.6872627 ]
 [0.8012507  0.65909815 1.         ... 0.47521716 0.6637959  0.5846829 ]
 ...
 [0.6475095  0.4374785  0.47521716 ... 1.0000001  0.55050087 0.3266015 ]
 [0.64573866 0.53323054 0.6637959  ... 0.55050087 1.         0.32953513]
 [0.6454631  0.6872627  0.5846829  ... 0.3266015  0.32953513 1.        ]]
In [23]:
#Let's use kmeans to perform clustering.

clusters2,labels2=mykmeans(Xs,scandidates)

for cl in clusters2:   
    print()
    print('Number of clusters: '+str(cl[0]))
    print()
    print(cl[1])
Número de clusters: 3

[[0, 'hotel'], [0, 'hotels'], [0, 'restaurant'], [0, 'inn'], [0, 'apartment'], [0, 'shopping'], [0, 'rented'], [0, 'mall'], [0, 'cafe'], [0, 'upscale'], [0, 'dining'], [0, 'rooms'], [0, 'bedroom'], [0, 'apartments'], [0, 'shop'], [0, 'restaurants'], [0, 'parking'], [1, 'mansion'], [1, 'downtown'], [1, 'plaza'], [1, 'outside'], [1, 'room'], [1, 'palace'], [1, 'residence'], [1, 'opened'], [1, 'beach'], [1, 'airport'], [1, 'nearby'], [1, 'home'], [1, 'houses'], [1, 'overlooking'], [1, 'city'], [1, 'at'], [1, 'building'], [1, 'neighborhood'], [1, 'terrace'], [1, 'lodge'], [2, 'hilton'], [2, 'resort'], [2, 'luxury'], [2, 'sheraton'], [2, 'marriott'], [2, 'vegas'], [2, 'luxurious'], [2, 'resorts'], [2, 'boutique'], [2, 'suites'], [2, 'lodging'], [2, 'lounge'], [2, 'tourist']]

Número de clusters: 4

[[0, 'hotel'], [0, 'shopping'], [0, 'mall'], [0, 'downtown'], [0, 'plaza'], [0, 'outside'], [0, 'opened'], [0, 'airport'], [0, 'nearby'], [0, 'home'], [0, 'city'], [0, 'at'], [0, 'neighborhood'], [1, 'hotels'], [1, 'restaurant'], [1, 'inn'], [1, 'luxury'], [1, 'cafe'], [1, 'upscale'], [1, 'dining'], [1, 'shop'], [1, 'restaurants'], [1, 'luxurious'], [1, 'boutique'], [1, 'suites'], [1, 'lodging'], [1, 'lounge'], [2, 'hilton'], [2, 'resort'], [2, 'sheraton'], [2, 'marriott'], [2, 'beach'], [2, 'vegas'], [2, 'resorts'], [2, 'lodge'], [2, 'tourist'], [3, 'apartment'], [3, 'mansion'], [3, 'rented'], [3, 'room'], [3, 'palace'], [3, 'residence'], [3, 'rooms'], [3, 'bedroom'], [3, 'apartments'], [3, 'houses'], [3, 'overlooking'], [3, 'building'], [3, 'terrace'], [3, 'parking']]

Número de clusters: 5

[[0, 'apartment'], [0, 'rented'], [0, 'room'], [0, 'rooms'], [0, 'bedroom'], [0, 'apartments'], [0, 'houses'], [0, 'parking'], [1, 'mansion'], [1, 'plaza'], [1, 'palace'], [1, 'residence'], [1, 'beach'], [1, 'overlooking'], [1, 'terrace'], [1, 'lodge'], [2, 'hilton'], [2, 'resort'], [2, 'luxury'], [2, 'sheraton'], [2, 'marriott'], [2, 'vegas'], [2, 'luxurious'], [2, 'resorts'], [2, 'suites'], [2, 'lodging'], [2, 'tourist'], [3, 'downtown'], [3, 'outside'], [3, 'opened'], [3, 'airport'], [3, 'nearby'], [3, 'home'], [3, 'city'], [3, 'at'], [3, 'building'], [3, 'neighborhood'], [4, 'hotel'], [4, 'hotels'], [4, 'restaurant'], [4, 'inn'], [4, 'shopping'], [4, 'mall'], [4, 'cafe'], [4, 'upscale'], [4, 'dining'], [4, 'shop'], [4, 'restaurants'], [4, 'boutique'], [4, 'lounge']]

Número de clusters: 6

[[0, 'hotel'], [0, 'hotels'], [0, 'restaurant'], [0, 'shopping'], [0, 'mall'], [0, 'luxury'], [0, 'cafe'], [0, 'upscale'], [0, 'shop'], [0, 'restaurants'], [0, 'tourist'], [1, 'mansion'], [1, 'plaza'], [1, 'palace'], [1, 'residence'], [1, 'beach'], [1, 'overlooking'], [1, 'terrace'], [1, 'lodge'], [2, 'hilton'], [2, 'resort'], [2, 'sheraton'], [2, 'marriott'], [2, 'vegas'], [2, 'resorts'], [3, 'downtown'], [3, 'outside'], [3, 'opened'], [3, 'airport'], [3, 'nearby'], [3, 'home'], [3, 'city'], [3, 'at'], [3, 'building'], [3, 'neighborhood'], [4, 'inn'], [4, 'dining'], [4, 'luxurious'], [4, 'boutique'], [4, 'suites'], [4, 'lodging'], [4, 'lounge'], [5, 'apartment'], [5, 'rented'], [5, 'room'], [5, 'rooms'], [5, 'bedroom'], [5, 'apartments'], [5, 'houses'], [5, 'parking']]

Número de clusters: 7

[[0, 'mansion'], [0, 'plaza'], [0, 'palace'], [0, 'residence'], [0, 'beach'], [0, 'overlooking'], [0, 'terrace'], [0, 'lodge'], [1, 'hotel'], [1, 'restaurant'], [1, 'shopping'], [1, 'mall'], [1, 'shop'], [2, 'downtown'], [2, 'outside'], [2, 'opened'], [2, 'airport'], [2, 'nearby'], [2, 'home'], [2, 'city'], [2, 'at'], [2, 'building'], [2, 'neighborhood'], [3, 'hotels'], [3, 'luxury'], [3, 'upscale'], [3, 'restaurants'], [3, 'luxurious'], [3, 'boutique'], [3, 'lodging'], [3, 'tourist'], [4, 'apartment'], [4, 'rented'], [4, 'room'], [4, 'rooms'], [4, 'bedroom'], [4, 'apartments'], [4, 'houses'], [4, 'parking'], [5, 'inn'], [5, 'cafe'], [5, 'dining'], [5, 'suites'], [5, 'lounge'], [6, 'hilton'], [6, 'resort'], [6, 'sheraton'], [6, 'marriott'], [6, 'vegas'], [6, 'resorts']]
In [24]:
#3 clusters graph

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


reduced_data = PCA(n_components=2).fit_transform(Xs)

fig, ax = plt.subplots(figsize=(16,10))
for index, instance in enumerate(reduced_data):
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels2[0][index]]
    ax.scatter(pca_comp_1, pca_comp_2, c=color)
    ax.annotate(s=str(scandidates[index]),xy=(pca_comp_1, pca_comp_2))
plt.show()
In [25]:
#7 clusters graph

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


reduced_data = PCA(n_components=2).fit_transform(Xs)

fig, ax = plt.subplots(figsize=(16,10))
for index, instance in enumerate(reduced_data):
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels2[4][index]]
    ax.scatter(pca_comp_1, pca_comp_2, c=color)
    ax.annotate(s=str(scandidates[index]),xy=(pca_comp_1, pca_comp_2))
plt.show()