Aşağıdaki linkteki sayfadaki kodu spider programında çalıştırmak istiyorum fakat


fakat linkte kullanılan UCI Hayvanat Bahçesi Veri Kümesini veri setini spidera nasıl entegre edeceğimi bilmiyorum lütfen yardımcı olurmusunuz .Kütüphane mi indirip kurmam gerek yoksa başka bir şey mi yapmalıyım?

Merhaba,

Linkte yer alan Download kısmına gelirseniz, http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/ adresine yönlendirilirsiniz. Bu adreste yer alan http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data adresindeki veri setini indirmeniz gerekiyor.

İndirdiğiniz dosyanın içine bakarsanız, sütun isimlerinin yazılı olmadığını görürsünüz. Eğer sütun isimleri yazılı olsaydı, pd.read_csv() fonksiyonunu kullanarak, veri setini kolayca okuyabilirdik. Ama şimdi aşağıda paylaştığım koddaki gibi bir yol izlemek gerekiyor. Ve sütun isimlerinin de elle oluşturulması gerekecek.

http://archive.ics.uci.edu/ml/datasets/zoo adresinde Attribute Information isimli bir kısım var, bu kısımda sütun isimleri listelenmiş.

Not: Dosyayı okumak için csv kütüphanesini kullandım. Ama alternatif olarak şöyle de yapılabilirdi; Dosya açıldıktan sonra, readlines() fonksiyonu ile dosya satır satır okunabilir ve her bir satırdaki str verisi split() fonksiyonu ile virgüllerden ayrılabilirdi.

import csv
import pandas as pd

with open("zoo.data") as f:
    data = [[i[0]] + list(map(int, i[1:])) for i in csv.reader(f)]

columns = [
    "animal name", "hair", "feathers", "eggs", "milk", "airborne",		
    "aquatic", "predator", "toothed", "backbone", "breathes", "venomous",	
    "fins", "legs", "tail", "domestic", "catsize", "type"
]

df = pd.DataFrame(data, columns=columns)
print(df)
    animal name  hair  feathers  eggs  ...  tail  domestic  catsize  type
0      aardvark     1         0     0  ...     0         0        1     1
1      antelope     1         0     0  ...     1         0        1     1
2          bass     0         0     1  ...     1         0        0     4
3          bear     1         0     0  ...     0         0        1     1
4          boar     1         0     0  ...     1         0        1     1
..          ...   ...       ...   ...  ...   ...       ...      ...   ...
96      wallaby     1         0     0  ...     1         0        1     1
97         wasp     1         0     1  ...     0         0        0     6
98         wolf     1         0     0  ...     1         0        1     1
99         worm     0         0     1  ...     0         0        0     7
100        wren     0         1     1  ...     1         0        0     2

[101 rows x 18 columns]

Eğitim esnasında muhtemelen veri setinin ilk sütununu almayacaksınız. Artık bundan sonrası da sizin konunuz ile alakalı.

1 Beğeni

Çok teşekkkür ederim dediğiniz aşamaları gerçekleştirdik ama farklı bir türde index hatası alıyorum.
kodu cvs formatına dönüştürdünm ilk anda verisetini düzgün bir sayısal tablo şeklinde yanda açabiliyordum ama kodu çalıştırdıktan sonra veri setindeki sayısal değerler yerine nan yazısı vardı.

@bgm Mümkünse yazdığınız kodları paylaşabilir misiniz?






“”"
Make the imports of python packages needed
“”"
import pandas as pd
import numpy as np
from pprint import pprint

#Import the dataset and define the feature as well as the target datasets / columns#
mydataset = pd.read_csv(‘zoo.csv’,
names=[‘animal_name’,‘hair’,‘feathers’,‘eggs’,‘milk’,
‘airbone’,‘aquatic’,‘predator’,‘toothed’,‘backbone’,
‘breathes’,‘venomous’,‘fins’,‘legs’,‘tail’,‘domestic’,‘catsize’,‘class’,])#Import all columns omitting the fist which consists the names of the animals

#We drop the animal names since this is not a good feature to split the data on
mydataset =mydataset.drop(‘animal_name’,axis=1)

###################

def entropy(target_col):
“”"
Calculate the entropy of a dataset.
The only parameter of this function is the target_col parameter which specifies the target column
“”"
elements,counts = np.unique(target_col,return_counts = True)
entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in range(len(elements))])
return entropy

###################

###################

def InfoGain(data,split_attribute_name,target_name=“class”):
“”"
Calculate the information gain of a dataset. This function takes three parameters:
1. data = The dataset for whose feature the IG should be calculated
2. split_attribute_name = the name of the feature for which the information gain should be calculated
3. target_name = the name of the target feature. The default for this example is “class”
“”"
#Calculate the entropy of the total dataset
total_entropy = entropy(data[target_name])

##Calculate the entropy of the dataset

#Calculate the values and the corresponding counts for the split attribute 
vals,counts= np.unique(data[split_attribute_name],return_counts=True)

#Calculate the weighted entropy
Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])

#Calculate the information gain
Information_Gain = total_entropy - Weighted_Entropy
return Information_Gain

###################

###################

def ID3(data,originaldata,features,target_attribute_name=“class”,parent_node_class = None):
“”"
ID3 Algorithm: This function takes five paramters:
1. data = the data for which the ID3 algorithm should be run --> In the first run this equals the total dataset

2. originaldata = This is the original dataset needed to calculate the mode target feature value of the original dataset
in the case the dataset delivered by the first parameter is empty

3. features = the feature space of the dataset . This is needed for the recursive call since during the tree growing process
we have to remove features from our dataset --> Splitting at each node

4. target_attribute_name = the name of the target attribute

5. parent_node_class = This is the value or class of the mode target feature value of the parent node for a specific node. This is 
also needed for the recursive call since if the splitting leads to a situation that there are no more features left in the feature
space, we want to return the mode target feature value of the direct parent node.
"""   
#Define the stopping criteria --> If one of this is satisfied, we want to return a leaf node#

#If all target_values have the same value, return this value
if len(np.unique(data[target_attribute_name])) <= 1:
    return np.unique(data[target_attribute_name])[0]

#If the dataset is empty, return the mode target feature value in the original dataset
elif len(data)==0:
    return np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_attribute_name],return_counts=True)[1])]

#If the feature space is empty, return the mode target feature value of the direct parent node --> Note that
#the direct parent node is that node which has called the current run of the ID3 algorithm and hence
#the mode target feature value is stored in the parent_node_class variable.

elif len(features) ==0:
    return parent_node_class

#If none of the above holds true, grow the tree!

else:
    #Set the default value for this node --> The mode target feature value of the current node
    parent_node_class = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],return_counts=True)[1])]
    
    #Select the feature which best splits the dataset
    item_values = [InfoGain(data,feature,target_attribute_name) for feature in features] #Return the information gain values for the features in the dataset
    best_feature_index = np.argmax(item_values)
    best_feature = features[best_feature_index]
    
    #Create the tree structure. The root gets the name of the feature (best_feature) with the maximum information
    #gain in the first run
    tree = {best_feature:{}}
    
    
    #Remove the feature with the best inforamtion gain from the feature space
    features = [i for i in features if i != best_feature]
    
    #Grow a branch under the root node for each possible value of the root node feature
    
    for value in np.unique(data[best_feature]):
        value = value
        #Split the dataset along the value of the feature with the largest information gain and therwith create sub_datasets
        sub_data = data.where(data[best_feature] == value).dropna()
        
        #Call the ID3 algorithm for each of those sub_datasets with the new parameters --> Here the recursion comes in!
        subtree = ID3(sub_data,mydataset,features,target_attribute_name,parent_node_class)
        
        #Add the sub tree, grown from the sub_dataset to the tree under the root node
        tree[best_feature][value] = subtree
        
    return(tree)    

###################

###################

def predict(query,tree,default = 1):
“”"
Prediction of a new/unseen query instance. This takes two parameters:
1. The query instance as a dictionary of the shape {“feature_name”:feature_value,…}

2. The tree 


We do this also in a recursive manner. That is, we wander down the tree and check if we have reached a leaf or if we are still in a sub tree. 
Since this is a important step to understand, the single steps are extensively commented below.

1.Check for every feature in the query instance if this feature is existing in the tree.keys() for the first call, 
tree.keys() only contains the value for the root node 
--> if this value is not existing, we can not make a prediction and have to 
return the default value which is the majority value of the target feature

2. First of all we have to take care of a important fact: Since we train our model with a database A and then show our model
a unseen query it may happen that the feature values of these query are not existing in our tree model because non of the
training instances has had such a value for this specific feature. 
For instance imagine the situation where your model has only seen animals with one to four
legs - The "legs" node in your model will only have four outgoing branches (from one to four). If you now show your model
a new instance (animal) which has for the legs feature the vale 5, you have to tell your model what to do in such a 
situation because otherwise there is no classification possible because in the classification step you try to 
run down the outgoing branch with the value 5 but there is no such a branch. Hence: Error and no Classification!
We can address this issue with a classification value of for instance (999) which tells us that there is no classification
possible or we assign the most frequent target feature value of our dataset used to train the model. Or, in for instance 
medical application we can return the most worse case - just to make sure... 
We can also return the most frequent value of the direct parent node. To make a long story short, we have to tell the model 
what to do in this situation.
In our example, since we are dealing with animal species where a false classification is not that critical, we will assign
the value 1 which is the value for the mammal species (for convenience).

3. Address the key in the tree which fits the value for key --> Note that key == the features in the query. 
Because we want the tree to predict the value which is hidden under the key value (imagine you have a drawn tree model on 
the table in front of you and you have a query instance for which you want to predict the target feature 
- What are you doing? - Correct:
You start at the root node and wander down the tree comparing your query to the node values. Hence you want to have the
value which is hidden under the current node. If this is a leaf, perfect, otherwise you wander the tree deeper until you
get to a leaf node. 
Though, you want to have this "something" [either leaf or sub_tree] which is hidden under the current node
and hence we must address the node in the tree which == the key value from our query instance. 
This is done with tree[keys]. Next you want to run down the branch of this node which is equal to the value given "behind"
the key value of your query instance e.g. if you find "legs" == to tree.keys() that is, for the first run == the root node.
You want to run deeper and therefore you have to address the branch at your node whose value is == to the value behind key.
This is done with query[key] e.g. query[key] == query['legs'] == 0 --> Therewith we run down the branch of the node with the
value 0. Summarized, in this step we want to address the node which is hidden behind a specific branch of the root node (in the first run)
this is done with: result = [key][query[key]]

4. As said in the 2. step, we run down the tree along nodes and branches until we get to a leaf node.
That is, if result = tree[key][query[key]] returns another tree object (we have represented this by a dict object --> 
that is if result is a dict object) we know that we have not arrived at a root node and have to run deeper the tree. 
Okay... Look at your drawn tree in front of you... what are you doing?...well, you run down the next branch... 
exactly as we have done it above with the slight difference that we already have passed a node and therewith 
have to run only a fraction of the tree --> You clever guy! That "fraction of the tree" is exactly what we have stored
under 'result'.
So we simply call our predict method using the same query instance (we do not have to drop any features from the query
instance since for instance the feature for the root node will not be available in any of the deeper sub_trees and hence 
we will simply not find that feature) as well as the "reduced / sub_tree" stored in result.

SUMMARIZED: If we have a query instance consisting of values for features, we take this features and check if the 
name of the root node is equal to one of the query features.
If this is true, we run down the root node outgoing branch whose value equals the value of query feature == the root node.
If we find at the end of this branch a leaf node (not a dict object) we return this value (this is our prediction).
If we instead find another node (== sub_tree == dict objct) we search in our query for the feature which equals the value 
of that node. Next we look up the value of our query feature and run down the branch whose value is equal to the 
query[key] == query feature value. And as you can see this is exactly the recursion we talked about
with the important fact that for each node we run down the tree, we check only the nodes and branches which are 
below this node and do not run the whole tree beginning at the root node 
--> This is why we re-call the classification function with 'result'
"""


#1.
for key in list(query.keys()):
    if key in list(tree.keys()):
        #2.
        try:
            result = tree[key][query[key]] 
        except:
            return default

        #3.
        result = tree[key][query[key]]
        #4.
        if isinstance(result,dict):
            return predict(query,result)

        else:
            return result

“”"
Check the accuracy of our prediction.
The train_test_split function takes the dataset as parameter which should be divided into
a training and a testing set. The test function takes two parameters, which are the testing data as well as the tree model.
“”"
###################

###################

def train_test_split(mydataset):
training_data = mydataset.iloc[:80].reset_index(drop=True)#We drop the index respectively relabel the index
#starting form 0, because we do not want to run into errors regarding the row labels / indexes
testing_data = mydataset.iloc[80:].reset_index(drop=True)
return training_data,testing_data

training_data = train_test_split(mydataset)[0]
testing_data = train_test_split(mydataset)[1]

def test(data,tree):
#Create new query instances by simply removing the target feature column from the original dataset and
#convert it to a dictionary
queries = data.iloc[:,:-1].to_dict(orient = “records”)

#Create a empty DataFrame in whose columns the prediction of the tree are stored
predicted = pd.DataFrame(columns=["predicted"]) 

#Calculate the prediction accuracy
for i in range(len(data)):
    predicted.loc[i,"predicted"] = predict(queries[i],tree,1.0) 
print('The prediction accuracy is: ',(np.sum(predicted["predicted"] == data["class"])/len(data))*100,'%')

“”"
Train the tree, Print the tree and predict the accuracy
“”"
tree = ID3(training_data,training_data,training_data.columns[:-1])
pprint(tree)
test(testing_data,tree)

Bu haliyle bakıp okuyabiliyor musunuz kodları?

1 Beğeni