Gain a hands-on understanding of Google BigQuery and SQL.
Due: February 25, 2022 4:59:59 PM
Welcome to CSEE4121! This assignment will help you get familiar with the Google BigQuery platform and the SQL language. Google has published many datasets on BigQuery – these range from StackOverflow statistics to real-time air quality data. In this first assignment, you will be using BigQuery’s SQL interface to answer questions about the Stackoverflow Dataset.
After completing this programming assignment, students should be able to:
This section will guide you through creating a BigQuery project and setting up your account so that you can query the stackoverflow dataset.
Make sure you have followed the instructions provided by CRF to redeem your credits in Google Cloud.
Click this link. You’ll see the page below; click “Create” to make a GCP (Google Cloud Platform) project.
Fill in the information to make a new project. Make sure to select the new billing account you should have after getting the class credits.
Please select the correct billing account for your new project. The billing account should be the one that you obtained from redeeming your GCP credits.
Incase you chose the wrong billing account while creating the project, you can change the billing account later as well:
If you only have one billing account (obtained from redeeming your credits), then it will automatically be linked with the newly created project (in Step 3).
But, if you had multiple billing accounts, then it is important to link the project with the right one. Please follow the instructions mentioned to change the billing account for your project to the one created in Step 1.
Go to this link, which is the page for BigQuery’s public datasets. In the top-right corner of the bottom panel, click the PIN PROJECT
.
Now when you visit the console, you’ll be able to easily find the datasets in your sidebar. Scroll down in the bigquery-public-dataset
to find the stackoverflow
dataset.
Click on the dataset, and you should see 16 tables in the dataset. Click on each table, and you should see the schema and descriptions of each table.
Although we will be dealing with only a subset of these tables, it is recommended to get yourself familiar with these tables before you jump into writing queries. We intend for this assignment to be about how to translate a question in plain English to a schema - in other words, we want you to read the tables and explore the data and think about which tables and columns are necessary in answering the question we’re asking. This is how real world data querying and analysis works!
Click on QUERY TABLE
on the bottom panel to start a query on a table.
Once you typed in your query, you should notice there is a green validator on the bottom right corner of the top panel. It will give you an estimation of how much data will be processed to run this query. Note: you will be billed by the number of bytes read by each query, and each month only the first 1TB is free, so be careful and try to avoid using SELECT *
In declarative languages, it’s easier to build up the query piece by piece. Start with a basic frame of what you’re looking for (maybe write the conditions, or do a join). Then add complexity to your query one bit at a time. It’s much easier to debug this way as well.
BigQuery can auto-format your SQL queries with CTRL-SHIFT-F
on Windows or CMD-SHIFT-F
on Mac (You might need to do some editing for it to take effect). This might be nice to learn about conventional SQL style guidelines (and will also make your queries more readable, which we appreciate).
Note: When querying in BigQuery, table names should be wrapped in backticks (`). For example, instead of saying:
SELECT * FROM bigquery-public-data.stackoverflow.users
say:
SELECT * FROM `bigquery-public-data.stackoverflow.users`
Please read the submission instructions carefully before starting the assignment.
Now that you’ve gotten comfortable and familiar with BigQuery and its SQL querying interface, let’s get to work and answer some questions about the Stackoverflow dataset!
You will be asked to submit a CSV file for each question, and you could download the results of a query using the SAVE RESULTS
button on the bottom panel and chose to save the CSV locally. We will provide you the column names for the expected output. You can name your columns whatever you want, as long as the content matches both in ordering and in values.
Notes:
badges
, posts_answers
, posts_questions
and users
. That is, you should ignore all the other tables that are present in the stackoverflow dataset, and use a subset of the above mentioned tables only for answering the below questions.(1 point) List top 10 users with the highest reputation with upvotes greater than 10000 and down votes less than 300. Order by highest reputation ie., in descending order.
display_name | reputation | up_votes | down_votes |
---|---|---|---|
(1 point) Distribution of users geographically. List the top 10 locations with most users along, in descending order with their count.
location | count |
---|---|
As you may have noticed already, there are many null values in the dataset. Also, the location information is not properly formatted. In the coming questions we take these issues into account.
(2 points) Distribution of users with respect to countries. List the top 10 countries with the most users along with their count. Order them by count in descending order.
Since the location data is very messy, we will take the following actions for some of the countries. If the location contains(case sensitive):
Omit all NULL values of locations in the results. For other locations (not mentioned in the above list), please keep the names as it is.
NOTE: Only use SELECT
statements. No UPDATE
statements are required.
country | num_users |
---|---|
(1 point) We now focus on user activity on Stackoverflow. List the number of active users every year (from when Stackoverflow was founded).
Order by years in descending order.
Active users with respect to a year are users whose “last_access_date” were in that year.It does not matter whether the user came online in previous years or not. For example, if a user’s last_access_date
was in the year 2020, then the user will only be considered an active user for the year 2020.
last_access_year | num_users |
---|---|
(2 points) Some users are dormant while some are active. We would like to see examples of users who are dormant.
Considering today is December 30, 2021 – List the users (id, display_name, last_access_date) along with days_since_last_access
and days_since_creation
.
List the top 10 users with the above requirements, in order of days_since_last_access
(descending order), days_since_creation
(descending order).
HINT: DATE_DIFF() function allows you to compare two dates and extract attributes from the same (you may need to cast the current field since it is of TIMESTAMP type).
id | display_name | last_access_date | days_since_last_access | days_since_creation |
---|---|---|---|---|
(2 points) How does a user’s reputation change with upvotes and downvotes? Since users’ reputation varies widely, we will bucketize the users into the following reputation buckets: 0-500, 501-5000, 5001-50000, 50001-500000, >500000
These are reputation values converted to intervals. Ensure that the above intervals are strings in your final output for the reputation_bucket
column.
For every reputation bucket, do the following:
upvote_ratio: (total upvotes)/(total downvotes)
Its the ratio of the sum of upvotes by the sum of downvotes for users that lie in that bucket.
Order by num_users
in descending order.
NOTE: Use the exact reputation bucket values mentioned above. Do not change the interval values.
HINT: Use CASE… WHEN operators for bucketizing, and ROUND function for rounding the floating to the appropriate number of decimal places.
reputation_bucket | upvote_ratio | num_users |
---|---|---|
(2 points) What were the most popular tags in the year 2020 on stack overflow based on their occurrence? List the top 10 tags along with their count.
Order by the count in descending order.
Clarification: You can get the distribution of tags with their counts from the tags table (overall). But it would be difficult to obtain the count for the year 2020. The suggestion is to consider the tags attribute in posts_questions
.
Note: The year corresponds to the year when the questions were created.
HINT: the SPLIT function allows you to split a string based on a delimiter. UNNEST function returns a table based on an array as input.
tag | count |
---|---|
We see that python and javascript were the most popular tags in the year 2020 and overall across all of stackoverflow as well.
(1 point) Badge Analysis: Stackoverflow rewards users with various kinds of achievement badges based on different milestones (similar to gaming achievements). The badges are divided into three classes: Gold (1), Silver(2) and Bronze(3). More about them here.
Each category of badge has several badges enlisted in them.
We would like to find out the most popular golden badges that stackoverflow awards.
List out the top 10 golden badges (name) along with the number of users who have received these badges in descending order of the number of users.
Clarification: A user can receive more than one golden badge of the same name. In this instance, we will count the number of users as two for the said golden badge..
name | num_users |
---|---|
(2 points) What is the relationship between a user’s reputation, and badges earned?
List out the top 10 users who have the highest number of golden badges along with the number of golden badges that they have earned.
Order by the number of gold badges (desc).
id | display_name | reputation | up_votes | down_votes | num_gold_badges |
---|---|---|---|---|---|
(2 points) “Illuminator” is the rarest badge that is awarded by stackoverflow. Up to now, only 148 users have been bestowed with this badge.
We would like to find out how early you can obtain this badge? This time, list out the top 20 users (id, display_name, reputation) who took the lowest number of days to receive their first illuminator badge from when they created the account, along with the number of days it took for them to receive their first illuminator badge (from when they created the account).
Order by the number of days in ascending order.
id | display_name | reputation | num_days |
---|---|---|---|
(1 point) You might remember that we did a comparison between reputation bucket and upvote_ratio. This question is pretty similar to that one – we would like to check a similar relationship between score and view counts for the questions asked on stackoverflow.
The score buckets: <0, 0-100, 101-1000, 1001-10000, >10000 These are the values of scores in intervals.
List out the average number of views for each score bucket value, using the posts_questions table.
Round it to 2 decimal places and order the same by the average views in ascending order. HINT: Use the ROUND function provided by Bigquery.
score_bucket | avg_num_views |
---|---|
We can see the high correlation between the two. This seems pretty natural, but it is important to confirm some of our assumptions before going into further analysis.
(1 point) Which weekday had the most number of answers being given?
List out the weekday (in numericals 1-7) along with the total number of answers given during that day.
Order by the number of answers in descending order.
HINT: DAYOFWEEK function allows you to extract the weekday. Range is 1-7 with 1 corresponding to Sunday. Even though the column is day_name, the values are numerical only.
day_name | num_answers |
---|---|
Interestingly, Wednesday has the most answers! It’s not surprising that users do not answer much during weekends.
(1 point) What is the percentage of questions that are answered every year? Note that we attribute the question to the year in which it was posted. For example, if a question was posted in year X and answered in year Y, we count it as a question of year X. A question that was answered has an answer_count field that is greater than zero
List out the year, number of questions (in that year) along with the percentage answered (multiply by 100 and then round it to 2 decimal places).
A question is said to be answered if the answer_count
field is greater than zero.
Order by year (ascending order).
Clarification: A question only belongs to the year in which it was created. For example, if a question had a creation year of 2020, then it will only belong in 2020.
HINT: Use the ROUND function available in Bigquery.
year | num_questions | percentage_answered |
---|---|---|
(2 points) List top 20 users who posted the highest number of answers along with their id, display_name, reputation and the number of answers.
Print only those users with the number of answers greater than 50. Order the same by number of answers in descending order.
In the table: the attribute id refers to the id of the answer text, the attribute owner_user_id refers to the user who has answered, and the attribute parent_id refers to the question id that the answer is for.
Clarification:
answer_count
and accepted_answer_id
fields are all null for posts_answers.id | display_name | reputation | num_answers |
---|---|---|---|
Gordon Linoff is on a roll!
(2 points) Follow up to the previous question: We would now like to find out the specialists users in python. A user is considered a python specialist if he/she has answered a large number of python based questions on stackoverflow.
A python based question is a question which has the substring “python” as one the tags.
With all the conditions remaining the same from the previous question, print out the top 20 python specialists along with the number of answers posted by them.
Note: Every answer in the posts_answers has a parent_id
attribute pointing to the question_id (in posts_questions
). Multiple answers can have the same parent_id.
id | display_name | reputation | num_answers |
---|---|---|---|
(2 points) Analyzing posts with extreme scores. List the average number of answers, favourite count, comment count for posts with:
There should be two rows in the table corresponding to the two scores and their averages.
Round up to 2 decimal places, and order by score.
score | avg_answers | avg_fav_count | avg_comments |
---|---|---|---|
<0 | |||
>10000 |
Note: Ensure that the output csv has two rows with score column as string as mentioned in the sample table (no spaces).
Stark difference between the attributes for the two scores! Confirms our assumptions, and these features become highly important later in the analysis.
Great Job!! This is the end of the assignment!
This is a solo project so every student should submit their own work.
Each student should submit one zip file to Gradescope.
Please ensure that your directory structure is as mentioned below. The zip file that you submit should have all the csv results as well as the hw1_submission.py file at the root.
If they are present in any other folder inside the zip file, then the autograder would result in an error.
The structure should be like:
slj2142_assignment1.zip
├── hw1_submission.py
├── q1.csv
├── q2.csv
├── q3.csv
├── q4.csv
├── q5.csv
├── q6.csv
├── q7.csv
├── q8.csv
├── q9.csv
├── q10.csv
├── q11.csv
├── q12.csv
├── q13.csv
├── q14.csv
├── q15.csv
└── q16.csv
When you submit to Gradescope, we will run a syntax checker that will make sure the following:
It should run immediately and return whether the query ran OK or if there were errors - please make sure that you get a positive result from this test in your final submission.
Handling Ties: You might notice that in case of ties (between two records of the ordering column), bigquery may produce either of the ordering. Do not worry about this. We have configured the autograder to handle ties and grade accordingly.
You will not see a final grade until after the project deadline. It is your responsibility to ensure that your final submission is free from Python or SQL syntax errors and that you follow all instructions in this section. We reserve the right to deduct points from your project if you do not follow the submission instructions, or if you have syntax errors in your queries.
Autograder automatically checks for cheating. Students that are caught cheating will receive a 0 grade.