Showing posts with label HiveQL. Show all posts
Showing posts with label HiveQL. Show all posts

Thursday, June 29, 2017

Sample Data Set for Hive Practice

1. Take sample data source for use case from below link:

                    http://www.grouplens.org/system/files/ml-1m.zip

2. It contains data around movies, users, ratings.  unzip it.

3. Below are the 3 files in archive:

  movies.dat, ratings.dat, users.dat

4. Files in above are delimited by '::' just to have better readability (and one example to handle delimiter) change the delimiter to something other, you can keep the same, I am changing it to '#'

sed 's/::/#/g' movies.dat
sed 's/::/#/g' users.dat
sed 's/::/#/g' ratings.dat

Contents of the file would be:

movies:

structure: 
id#name#genre

sample data :
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
4#Waiting to Exhale (1995)#Comedy|Drama

users:

structure:
id#gender#age#occupationid#zipcode

sample data:
1#F#1#10#48067
2#M#56#16#70072
3#M#25#15#55117
4#M#45#7#02460
5#M#25#20#55455


ratings:

structure:
userid#movieid#rating#tmstmp

Sample Data:
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
1#3408#4#978300275
1#2355#5#978824291

just to have meaningful data, create an occupation data set

create a file named occupation.dat with below data:

vim occupation.dat

copy paste below and save the file.

0#other/not specified
1#academic/educator
2#artist
3#clerical/admin
4#college/grad student
5#customer service
6#doctor/health care
7#executive/managerial
8#farmer
9#homemaker
10#K-12 student
11#lawyer
12#programmer
13#retired
14#sales/marketing
15#scientist
16#self-employed
17#technician/engineer
18#tradesman/craftsman
19#unemployed
20#writer


Move the above files into the HDFS:

I have created 4 directories in /hive/data named user, movie, rating, occupation

hadoop fs -put occupations.dat /hive/data/occupation
hadoop fs -put users.dat /hive/data/user
hadoop fs -put movies.dat /hive/data/movie
hadoop fs -put ratngs.dat /hive/data/rating

Wednesday, June 28, 2017

Practice Hive Queries ( HiveQL Practice )

set up sample data set
if the data set up is done now let's do the hive stuff:

1. create a separate database named movielens
          create database movielens;
          use movielens;
       
2. create tables to hold data
     
          CREATE EXTERNAL TABLE ratings (
                   userid INT,
                   movieid INT,
                   rating INT,
                   tstamp STRING
                    ) ROW FORMAT DELIMITED
                    FIELDS TERMINATED BY '#'
                    STORED AS TEXTFILE
                    LOCATION '/hive/data/rating';
                 
       
          CREATE EXTERNAL TABLE movies (
                    movieid INT,
                    title STRING,
                    genres ARRAY<STRING>
                  ) ROW FORMAT DELIMITED
                  FIELDS TERMINATED BY '#'
                  COLLECTION ITEMS TERMINATED BY "|"
                  STORED AS TEXTFILE
                  LOCATION '/hive/data/movie';
               
          CREATE EXTERNAL TABLE users (
                    userid INT,
                    gender STRING,
                    age INT,
                    occupation_id INT,
                    zipcode STRING
                  ) ROW FORMAT DELIMITED
                  FIELDS TERMINATED BY '#'
                  STORED AS TEXTFILE
                  LOCATION '/hive/data/user';
         
           CREATE EXTERNAL TABLE occupations (
                    id INT,
                    occupation STRING
                  ) ROW FORMAT DELIMITED
                  FIELDS TERMINATED BY '#'
                  STORED AS TEXTFILE
                  LOCATION '/hive/data/occupation';
               
3. see if data is loaded

hive> select * from users limit 2;
OK
1 F 1 10 48067
2 M 56 16 70072
Time taken: 0.278 seconds, Fetched: 2 row(s)

hive> select * from movies limit 2;
OK
1 Toy Story (1995) ["Animation","Children's","Comedy"]
2 Jumanji (1995) ["Adventure","Children's","Fantasy"]
Time taken: 0.352 seconds, Fetched: 2 row(s)

hive> select * from ratings limit 2;
OK
1 1193 5 978300760
1 661 3 978302109
Time taken: 0.28 seconds, Fetched: 2 row(s)

hive> select * from occupations limit 2;
OK
0 other/not specified
1 academic/educator
Time taken: 0.245 seconds, Fetched: 2 row(s)

if you are all good till here than lets practice hiveQL stuffs.

NOTE: in each case to maintain readabilty I will limit the output to 10 only.

Use Case 1:
Find out Occupation of all the users:

Solution:
select u.*, o.occupation from users u, occupations o where u.occupation_id= o.id limit 10;


OUTPUT:

1 F 1 10 48067 K-12 student
2 M 56 16 70072 self-employed
3 M 25 15 55117 scientist
4 M 45 7 02460 executive/managerial
5 M 25 20 55455 writer
6 F 50 9 55117 homemaker
7 M 35 1 06810 academic/educator
8 M 25 12 11413 programmer
9 M 25 17 61614 technician/engineer
10 F 35 1 95370 academic/educator


Use Case 2:
Find out numbers of non-adults as per Indian standard, who has rated movies:

Solution: select count(*) from users where age < 18;
222

Use case 3:
Find out the no of users with same occupation and having age more than 25 along with occupation details:

Solution:
select o.occupation, count(1)  from users u join occupations o where u.occupation_id= o.id AND u.age > 24 group by o.occupation;

K-12 student 3
academic/educator 479
artist 220
clerical/admin 155
college/grad student 222
customer service 94
doctor/health care 227
executive/managerial 660
farmer 15
homemaker 86
lawyer 121
other/not specified 578
programmer 328
retired 141
sales/marketing 263
scientist 130
self-employed 223
technician/engineer 448
tradesman/craftsman 60
unemployed 30
writer 232


Use Case 4: Find the age of the most rated user with counts of rating;

Solution:

select u.userid, u.age, x.count from users u join ( select r.userid, count(rating) count from ratings r group by (r.userid) order by count DESC limit 1) x where u.userid = x.userid;

4169 50 2314



---------------------
I am working on it, will add more examples here.