Thursday, June 29, 2017

Sample Data Set for Hive Practice

1. Take sample data source for use case from below link:

                    http://www.grouplens.org/system/files/ml-1m.zip

2. It contains data around movies, users, ratings.  unzip it.

3. Below are the 3 files in archive:

  movies.dat, ratings.dat, users.dat

4. Files in above are delimited by '::' just to have better readability (and one example to handle delimiter) change the delimiter to something other, you can keep the same, I am changing it to '#'

sed 's/::/#/g' movies.dat
sed 's/::/#/g' users.dat
sed 's/::/#/g' ratings.dat

Contents of the file would be:

movies:

structure: 
id#name#genre

sample data :
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
4#Waiting to Exhale (1995)#Comedy|Drama

users:

structure:
id#gender#age#occupationid#zipcode

sample data:
1#F#1#10#48067
2#M#56#16#70072
3#M#25#15#55117
4#M#45#7#02460
5#M#25#20#55455


ratings:

structure:
userid#movieid#rating#tmstmp

Sample Data:
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
1#3408#4#978300275
1#2355#5#978824291

just to have meaningful data, create an occupation data set

create a file named occupation.dat with below data:

vim occupation.dat

copy paste below and save the file.

0#other/not specified
1#academic/educator
2#artist
3#clerical/admin
4#college/grad student
5#customer service
6#doctor/health care
7#executive/managerial
8#farmer
9#homemaker
10#K-12 student
11#lawyer
12#programmer
13#retired
14#sales/marketing
15#scientist
16#self-employed
17#technician/engineer
18#tradesman/craftsman
19#unemployed
20#writer


Move the above files into the HDFS:

I have created 4 directories in /hive/data named user, movie, rating, occupation

hadoop fs -put occupations.dat /hive/data/occupation
hadoop fs -put users.dat /hive/data/user
hadoop fs -put movies.dat /hive/data/movie
hadoop fs -put ratngs.dat /hive/data/rating

No comments: