Showing posts with label group by. Show all posts
Showing posts with label group by. Show all posts

Wednesday, June 28, 2017

Practice Hive Queries ( HiveQL Practice )

set up sample data set
if the data set up is done now let's do the hive stuff:

1. create a separate database named movielens
          create database movielens;
          use movielens;
       
2. create tables to hold data
     
          CREATE EXTERNAL TABLE ratings (
                   userid INT,
                   movieid INT,
                   rating INT,
                   tstamp STRING
                    ) ROW FORMAT DELIMITED
                    FIELDS TERMINATED BY '#'
                    STORED AS TEXTFILE
                    LOCATION '/hive/data/rating';
                 
       
          CREATE EXTERNAL TABLE movies (
                    movieid INT,
                    title STRING,
                    genres ARRAY<STRING>
                  ) ROW FORMAT DELIMITED
                  FIELDS TERMINATED BY '#'
                  COLLECTION ITEMS TERMINATED BY "|"
                  STORED AS TEXTFILE
                  LOCATION '/hive/data/movie';
               
          CREATE EXTERNAL TABLE users (
                    userid INT,
                    gender STRING,
                    age INT,
                    occupation_id INT,
                    zipcode STRING
                  ) ROW FORMAT DELIMITED
                  FIELDS TERMINATED BY '#'
                  STORED AS TEXTFILE
                  LOCATION '/hive/data/user';
         
           CREATE EXTERNAL TABLE occupations (
                    id INT,
                    occupation STRING
                  ) ROW FORMAT DELIMITED
                  FIELDS TERMINATED BY '#'
                  STORED AS TEXTFILE
                  LOCATION '/hive/data/occupation';
               
3. see if data is loaded

hive> select * from users limit 2;
OK
1 F 1 10 48067
2 M 56 16 70072
Time taken: 0.278 seconds, Fetched: 2 row(s)

hive> select * from movies limit 2;
OK
1 Toy Story (1995) ["Animation","Children's","Comedy"]
2 Jumanji (1995) ["Adventure","Children's","Fantasy"]
Time taken: 0.352 seconds, Fetched: 2 row(s)

hive> select * from ratings limit 2;
OK
1 1193 5 978300760
1 661 3 978302109
Time taken: 0.28 seconds, Fetched: 2 row(s)

hive> select * from occupations limit 2;
OK
0 other/not specified
1 academic/educator
Time taken: 0.245 seconds, Fetched: 2 row(s)

if you are all good till here than lets practice hiveQL stuffs.

NOTE: in each case to maintain readabilty I will limit the output to 10 only.

Use Case 1:
Find out Occupation of all the users:

Solution:
select u.*, o.occupation from users u, occupations o where u.occupation_id= o.id limit 10;


OUTPUT:

1 F 1 10 48067 K-12 student
2 M 56 16 70072 self-employed
3 M 25 15 55117 scientist
4 M 45 7 02460 executive/managerial
5 M 25 20 55455 writer
6 F 50 9 55117 homemaker
7 M 35 1 06810 academic/educator
8 M 25 12 11413 programmer
9 M 25 17 61614 technician/engineer
10 F 35 1 95370 academic/educator


Use Case 2:
Find out numbers of non-adults as per Indian standard, who has rated movies:

Solution: select count(*) from users where age < 18;
222

Use case 3:
Find out the no of users with same occupation and having age more than 25 along with occupation details:

Solution:
select o.occupation, count(1)  from users u join occupations o where u.occupation_id= o.id AND u.age > 24 group by o.occupation;

K-12 student 3
academic/educator 479
artist 220
clerical/admin 155
college/grad student 222
customer service 94
doctor/health care 227
executive/managerial 660
farmer 15
homemaker 86
lawyer 121
other/not specified 578
programmer 328
retired 141
sales/marketing 263
scientist 130
self-employed 223
technician/engineer 448
tradesman/craftsman 60
unemployed 30
writer 232


Use Case 4: Find the age of the most rated user with counts of rating;

Solution:

select u.userid, u.age, x.count from users u join ( select r.userid, count(rating) count from ratings r group by (r.userid) order by count DESC limit 1) x where u.userid = x.userid;

4169 50 2314



---------------------
I am working on it, will add more examples here.