set up sample data set
if the data set up is done now let's do the hive stuff:
1. create a separate database named movielens
create database movielens;
use movielens;
2. create tables to hold data
CREATE EXTERNAL TABLE ratings (
userid INT,
movieid INT,
rating INT,
tstamp STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/hive/data/rating';
CREATE EXTERNAL TABLE movies (
movieid INT,
title STRING,
genres ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
COLLECTION ITEMS TERMINATED BY "|"
STORED AS TEXTFILE
LOCATION '/hive/data/movie';
CREATE EXTERNAL TABLE users (
userid INT,
gender STRING,
age INT,
occupation_id INT,
zipcode STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/hive/data/user';
CREATE EXTERNAL TABLE occupations (
id INT,
occupation STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/hive/data/occupation';
3. see if data is loaded
hive> select * from users limit 2;
OK
1 F 1 10 48067
2 M 56 16 70072
Time taken: 0.278 seconds, Fetched: 2 row(s)
hive> select * from movies limit 2;
OK
1 Toy Story (1995) ["Animation","Children's","Comedy"]
2 Jumanji (1995) ["Adventure","Children's","Fantasy"]
Time taken: 0.352 seconds, Fetched: 2 row(s)
hive> select * from ratings limit 2;
OK
1 1193 5 978300760
1 661 3 978302109
Time taken: 0.28 seconds, Fetched: 2 row(s)
hive> select * from occupations limit 2;
OK
0 other/not specified
1 academic/educator
Time taken: 0.245 seconds, Fetched: 2 row(s)
if you are all good till here than lets practice hiveQL stuffs.
NOTE: in each case to maintain readabilty I will limit the output to 10 only.
Use Case 1:
Find out Occupation of all the users:
Solution:
select u.*, o.occupation from users u, occupations o where u.occupation_id= o.id limit 10;
OUTPUT:
1 F 1 10 48067 K-12 student
2 M 56 16 70072 self-employed
3 M 25 15 55117 scientist
4 M 45 7 02460 executive/managerial
5 M 25 20 55455 writer
6 F 50 9 55117 homemaker
7 M 35 1 06810 academic/educator
8 M 25 12 11413 programmer
9 M 25 17 61614 technician/engineer
10 F 35 1 95370 academic/educator
Use Case 2:
Find out numbers of non-adults as per Indian standard, who has rated movies:
Solution: select count(*) from users where age < 18;
222
Use case 3:
Find out the no of users with same occupation and having age more than 25 along with occupation details:
Solution:
select o.occupation, count(1) from users u join occupations o where u.occupation_id= o.id AND u.age > 24 group by o.occupation;
K-12 student 3
academic/educator 479
artist 220
clerical/admin 155
college/grad student 222
customer service 94
doctor/health care 227
executive/managerial 660
farmer 15
homemaker 86
lawyer 121
other/not specified 578
programmer 328
retired 141
sales/marketing 263
scientist 130
self-employed 223
technician/engineer 448
tradesman/craftsman 60
unemployed 30
writer 232
Use Case 4: Find the age of the most rated user with counts of rating;
Solution:
select u.userid, u.age, x.count from users u join ( select r.userid, count(rating) count from ratings r group by (r.userid) order by count DESC limit 1) x where u.userid = x.userid;
4169 50 2314
---------------------
I am working on it, will add more examples here.
if the data set up is done now let's do the hive stuff:
1. create a separate database named movielens
create database movielens;
use movielens;
2. create tables to hold data
CREATE EXTERNAL TABLE ratings (
userid INT,
movieid INT,
rating INT,
tstamp STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/hive/data/rating';
CREATE EXTERNAL TABLE movies (
movieid INT,
title STRING,
genres ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
COLLECTION ITEMS TERMINATED BY "|"
STORED AS TEXTFILE
LOCATION '/hive/data/movie';
CREATE EXTERNAL TABLE users (
userid INT,
gender STRING,
age INT,
occupation_id INT,
zipcode STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/hive/data/user';
CREATE EXTERNAL TABLE occupations (
id INT,
occupation STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE
LOCATION '/hive/data/occupation';
3. see if data is loaded
hive> select * from users limit 2;
OK
1 F 1 10 48067
2 M 56 16 70072
Time taken: 0.278 seconds, Fetched: 2 row(s)
hive> select * from movies limit 2;
OK
1 Toy Story (1995) ["Animation","Children's","Comedy"]
2 Jumanji (1995) ["Adventure","Children's","Fantasy"]
Time taken: 0.352 seconds, Fetched: 2 row(s)
hive> select * from ratings limit 2;
OK
1 1193 5 978300760
1 661 3 978302109
Time taken: 0.28 seconds, Fetched: 2 row(s)
hive> select * from occupations limit 2;
OK
0 other/not specified
1 academic/educator
Time taken: 0.245 seconds, Fetched: 2 row(s)
if you are all good till here than lets practice hiveQL stuffs.
NOTE: in each case to maintain readabilty I will limit the output to 10 only.
Use Case 1:
Find out Occupation of all the users:
Solution:
select u.*, o.occupation from users u, occupations o where u.occupation_id= o.id limit 10;
OUTPUT:
1 F 1 10 48067 K-12 student
2 M 56 16 70072 self-employed
3 M 25 15 55117 scientist
4 M 45 7 02460 executive/managerial
5 M 25 20 55455 writer
6 F 50 9 55117 homemaker
7 M 35 1 06810 academic/educator
8 M 25 12 11413 programmer
9 M 25 17 61614 technician/engineer
10 F 35 1 95370 academic/educator
Use Case 2:
Find out numbers of non-adults as per Indian standard, who has rated movies:
Solution: select count(*) from users where age < 18;
222
Use case 3:
Find out the no of users with same occupation and having age more than 25 along with occupation details:
Solution:
select o.occupation, count(1) from users u join occupations o where u.occupation_id= o.id AND u.age > 24 group by o.occupation;
K-12 student 3
academic/educator 479
artist 220
clerical/admin 155
college/grad student 222
customer service 94
doctor/health care 227
executive/managerial 660
farmer 15
homemaker 86
lawyer 121
other/not specified 578
programmer 328
retired 141
sales/marketing 263
scientist 130
self-employed 223
technician/engineer 448
tradesman/craftsman 60
unemployed 30
writer 232
Use Case 4: Find the age of the most rated user with counts of rating;
Solution:
select u.userid, u.age, x.count from users u join ( select r.userid, count(rating) count from ratings r group by (r.userid) order by count DESC limit 1) x where u.userid = x.userid;
4169 50 2314
---------------------
I am working on it, will add more examples here.
4 comments:
You actually make it look so easy with your performance but I find this matter to be actually something which I think I would never comprehend. It seems too complicated and extremely broad for me. I'm looking forward for your next post, I’ll try to get the hang of it!
360digiTMG Best Data Science Institute
Thanks for such very great information..
home automation services near me
Thanks for such very great information..
Home Remodeling contractors around South Lyon
Thanks for such very great information..
https://diyaconstructiongroup.com/
Post a Comment