Network and Platform Analytics

1 Tutorial

Go through this notebook for a tutorial on R and igraph: Tutorial Notebook

2 Download data

For this assignment you will use data from Amazon, provided on Brightspace. The nodes in this network are Amazon products, including books, movies, and music. The edges in this network represent hyperlinks from a given product’s landing page to the landing pages of those products most frequently co-purchased with the given product.

When you unzip data.zip, you will have access to the following data files:

1. graph complete.txt: The edges of the graph in the form from → to. Each line is an edge,

with the origin node and destination node separated by a space. The data set includes 366,987 product nodes and 1,231,400 co-purchase edges.

2. graph subset rank1000.txt: A subset of the complete network, containing only products

with salesrank under 1,000. Each line is an edge where each node is separated by a space. The data set includes 1,355 product nodes and 2,611 co-purchase edges.

Note: Multiple products may share the same salesrank in our data, so there are more than 1,000 products with salesrank under 1,000.

3. graph subset rank1000 cc.txt: The largest connected component in the network of products with salesrank under 1,000. Each line is an edge where each node is separated by a space. The data set includes 292 product nodes and 604 co-purchase edges.

4. id to titles.txt: Maps the integer ids (primary keys) used to identify nodes to the actual

names of the products. There are two space-separated fields in this file: the integer id and the string title.

The raw data are available from the Stanford Network Analysis Project (http://snap.stanford. edu/data/amazon-meta.html) and were collected in summer 2006. The original dataset contains 548,552 records of books, movies, and music sold on Amazon.com, along with product categories, reviews, and information on co-purchased products. We cleaned and filtered the data as follows:

1. graph complete.txt: We removed discontinued products, and removed edges involving products for which no metadata was available. That is, we kept only products that had a copurchase link to another product in the dataset.

2. graph subset rank1000.txt: In addition to the above, we kept only products that had a

salesrank between 0 and 1,000, and kept only co-purchase links between items in this reduced set of products.

3. graph subset rank1000 cc.txt: In addition to the above, we kept only the largest connected component from this graph.

3 Network structure visualization

Install the igraph package for R (suggested) or Python. To do so, in R type:1

1 # Download and install the package
2 install . packages (" igraph ")

1. Plot the network using the information in the file graph subset rank1000.txt. Note that

this is not the complete network, but only a subset of edges between top-ranked products. By visualizing the graph, you get an idea of the structure of the network you will be working on. In addition to plotting, comment on anything interesting you observe.

Hints:

Refer to https://kateto.net/netscix2016.html for a tutorial on igraph and R basics.
You may find it useful to treat this data file as being in ncol format in igraph.
It may be simplest to treat the network as undirected for the purposes of visualization (since directed arrows can add a lot of visual clutter in a graph of this size).
Playing with the size, color, and layout of objects may make the network easier to visualize. When plotting you can start with layout=layout.auto and then experiment with other options. layout=layout.kamada.kawai generally gives good results.

2. Now, use the file graph subset rank1000 cc.txt to plot only the largest connected component in the above network. You should be able to reuse your code from above on the new data.

4 Data analysis

For the rest of the assignment, use the complete graph contained in the file graph complete.txt and the title file id to titles.csv. It will be in in your best interest to using a programming language such as R or Python. If you face computational challenges analyzing the larger data set graph complete.txt, you may contact your TA for permission to use the data set graph subset rank1000.txt instead, with a brief explanation about what barriers you faced using the big data option.

Note: Here, we are working with a directed graph. For example, the “Grapes of Wrath” product page might highlight a co-purchase link to “East of Eden”, but the “East of Eden” product page might not necessarily link back to the “Grapes of Wrath” product page, and might instead link to “The Winter of Our Discontent”. Each product can have multiple inbound or outbound edges.

1. Plot the out-degree distribution of our dataset (x-axis number of similar products, y-axis number of nodes). That is, for each product a, count the number of outgoing links to another product page b such that a → b.

Hint: The following steps will outline one way to approach this problem.

(a) Start by calculating the out-degree for each product. You may use the table command in R or a dict in Python to compute the number of outbound links for each product.

(b) You can then apply the same process you just used so that you can count the number of products (nodes) that have a particular number of outgoing links. This is the out degree

distribution.

(c) Once you are done, you can use the default plotting environment in R, ggplot2 in R, or matplotlib3 in Python to plot the distribution. Note that you can avoid step (b) if you use the geom density() function in ggplot or the hist() method in matplotlib. However, you may approach this any way you wish.

2. Above, you should have found that each product contains a maximum of five outbound links to similar products in the dataset. Now, plot the in-degree distribution of our dataset (x-axis number of similar products, y-axis number of nodes). That is, for each product a, count the number of incoming links from another product page b such that b → a. You can use the same steps outlined above. Is the distribution different? Comment on what you observe.

3. Transform the x-axis of the previous graph to log scale, to get a better understanding of the distribution. Note here that you should have some products with 0 inbound links. This means that using the log of the x-axis will fail since log(0) will not be valid. Due to this, you should replace 0 with 0.1. Comment on what you observe.

4. Compute the average number of inbound co-purchase links, the standard deviation, and the maximum. Comment on the result.

5. Report the names of the 10 products with the most inbound co-purchase links.

Screenshot of Output