1. Take sample data source for use case from below link:
http://www.grouplens.org/system/files/ml-1m.zip
2. It contains data around movies, users, ratings. unzip it.
3. Below are the 3 files in archive:
movies.dat, ratings.dat, users.dat
4. Files in above are delimited by '::' just to have better readability (and one example to handle delimiter) change the delimiter to something other, you can keep the same, I am changing it to '#'
sed 's/::/#/g' movies.dat
sed 's/::/#/g' users.dat
sed 's/::/#/g' ratings.dat
Contents of the file would be:
movies:
structure:
id#name#genre
sample data :
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
4#Waiting to Exhale (1995)#Comedy|Drama
users:
structure:
id#gender#age#occupationid#zipcode
sample data:
1#F#1#10#48067
2#M#56#16#70072
3#M#25#15#55117
4#M#45#7#02460
5#M#25#20#55455
ratings:
structure:
userid#movieid#rating#tmstmp
Sample Data:
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
1#3408#4#978300275
1#2355#5#978824291
just to have meaningful data, create an occupation data set
create a file named occupation.dat with below data:
vim occupation.dat
copy paste below and save the file.
0#other/not specified
1#academic/educator
2#artist
3#clerical/admin
4#college/grad student
5#customer service
6#doctor/health care
7#executive/managerial
8#farmer
9#homemaker
10#K-12 student
11#lawyer
12#programmer
13#retired
14#sales/marketing
15#scientist
16#self-employed
17#technician/engineer
18#tradesman/craftsman
19#unemployed
20#writer
Move the above files into the HDFS:
I have created 4 directories in /hive/data named user, movie, rating, occupation
hadoop fs -put occupations.dat /hive/data/occupation
hadoop fs -put users.dat /hive/data/user
hadoop fs -put movies.dat /hive/data/movie
hadoop fs -put ratngs.dat /hive/data/rating
http://www.grouplens.org/system/files/ml-1m.zip
2. It contains data around movies, users, ratings. unzip it.
3. Below are the 3 files in archive:
movies.dat, ratings.dat, users.dat
4. Files in above are delimited by '::' just to have better readability (and one example to handle delimiter) change the delimiter to something other, you can keep the same, I am changing it to '#'
sed 's/::/#/g' movies.dat
sed 's/::/#/g' users.dat
sed 's/::/#/g' ratings.dat
Contents of the file would be:
movies:
structure:
id#name#genre
sample data :
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
4#Waiting to Exhale (1995)#Comedy|Drama
users:
structure:
id#gender#age#occupationid#zipcode
sample data:
1#F#1#10#48067
2#M#56#16#70072
3#M#25#15#55117
4#M#45#7#02460
5#M#25#20#55455
ratings:
structure:
userid#movieid#rating#tmstmp
Sample Data:
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
1#3408#4#978300275
1#2355#5#978824291
just to have meaningful data, create an occupation data set
create a file named occupation.dat with below data:
vim occupation.dat
copy paste below and save the file.
0#other/not specified
1#academic/educator
2#artist
3#clerical/admin
4#college/grad student
5#customer service
6#doctor/health care
7#executive/managerial
8#farmer
9#homemaker
10#K-12 student
11#lawyer
12#programmer
13#retired
14#sales/marketing
15#scientist
16#self-employed
17#technician/engineer
18#tradesman/craftsman
19#unemployed
20#writer
Move the above files into the HDFS:
I have created 4 directories in /hive/data named user, movie, rating, occupation
hadoop fs -put occupations.dat /hive/data/occupation
hadoop fs -put users.dat /hive/data/user
hadoop fs -put movies.dat /hive/data/movie
hadoop fs -put ratngs.dat /hive/data/rating