Network and Platform Analytics: Amazon

Mar 26, 2024
7 min read

Introduction

Welcome to our blog post where we delve into a sample project requirement aimed at revolutionizing e-commerce operations through the power of predictive analytics and machine learning. In this demonstration, we'll outline a comprehensive solution approach that showcases how businesses can harness data-driven insights to elevate decision-making processes and streamline operational efficiency within the dynamic landscape of e-commerce.

Project Requirement

Problem Statement:

Understanding the network structure and dynamics of Amazon's product ecosystem is crucial for various analytical tasks such as recommendation systems, market analysis, and understanding consumer behavior. In this context, the problem arises in analyzing the network and platform analytics of Amazon products, focusing on the co-purchase relationships between them. This involves exploring the network topology, distribution of links, and characteristics of inbound and outbound connections within the product network.

Objective:

The objective of this project is to conduct a comprehensive analysis of Amazon's product network using provided data sets, focusing on the following aspects:

Download data

For this assignment you will use data from Amazon, provided on Brightspace. The nodes in this network are Amazon products, including books, movies, and music. The edges in this network represent hyperlinks from a given product’s landing page to the landing pages of those products most frequently co-purchased with the given product.

When you unzip data.zip, you will have access to the following data files:

1. graph complete.txt: The edges of the graph in the form from → to. Each line is an edge, with the origin node and destination node separated by a space. The data set includes 366,987 product nodes and 1,231,400 co-purchase edges.

2. graph subset rank1000.txt: A subset of the complete network, containing only products with salesrank under 1,000. Each line is an edge where each node is separated by a space. The data set includes 1,355 product nodes and 2,611 co-purchase edges.

Note: Multiple products may share the same salesrank in our data, so there are more than 1,000 products with salesrank under 1,000.

3. graph subset rank1000 cc.txt: The largest connected component in the network of prod ucts with salesrank under 1,000. Each line is an edge where each node is separated by a space. The data set includes 292 product nodes and 604 co-purchase edges.

4. id to titles.txt: Maps the integer ids (primary keys) used to identify nodes to the actual names of the products. There are two space-separated fields in this file: the integer id and the string title.

The raw data are available from the Stanford Network Analysis Project (http://snap.stanford. edu/data/amazon-meta.html) and were collected in summer 2006. The original dataset contains 548,552 records of books, movies, and music sold on Amazon.com, along with product categories, reviews, and information on co-purchased products. We cleaned and filtered the data as follows:

1. graph complete.txt: We removed discontinued products, and removed edges involving prod ucts for which no metadata was available. That is, we kept only products that had a co purchase link to another product in the dataset.

2. graph subset rank1000.txt: In addition to the above, we kept only products that had a salesrank between 0 and 1,000, and kept only co-purchase links between items in this reduced set of products.

3. graph subset rank1000 cc.txt: In addition to the above, we kept only the largest connected component from this graph.

Network structure visualization

Install the igraph package for R (suggested) or Python. To do so, in R type:1

1 # Download and install the package

2 install . packages (" igraph ")

Plot the network using the information in the file graph subset rank1000.txt. Note that this is not the complete network, but only a subset of edges between top-ranked products. By visualizing the graph, you get an idea of the structure of the network you will be working on. In addition to plotting, comment on anything interesting you observe.

Hints:

Refer to https://kateto.net/netscix2016.html for a tutorial on igraph and R basics. • You may find it useful to treat this data file as being in ncol format in igraph. • It may be simplest to treat the network as undirected for the purposes of visualization (since directed arrows can add a lot of visual clutter in a graph of this size). • Playing with the size, color, and layout of objects may make the network easier to visualize. When plotting you can start with layout=layout.auto and then experiment with other options. layout=layout.kamada.kawai generally gives good results.

Now, use the file graph subset rank1000 cc.txt to plot only the largest connected compo nent in the above network. You should be able to reuse your code from above on the new data.

Data analysis

For the rest of the assignment, use the complete graph contained in the file graph complete.txt and the title file id to titles.csv. It will be in in your best interest to using a programming language such as R or Python.

If you face computational challenges analyzing the larger data set graph complete.txt, you may contact your TA for permission to use the data set graph subset rank1000.txt instead, with a brief explanation about what barriers you faced using the big data option.

Note: Here, we are working with a directed graph. For example, the “Grapes of Wrath” prod uct page might highlight a co-purchase link to “East of Eden”, but the “East of Eden” product page might not necessarily link back to the “Grapes of Wrath” product page, and might instead link to “The Winter of Our Discontent”. Each product can have multiple inbound or outbound edges.

See http://igraph.org/r/ or http://igraph.org/python/ for more information on igraph for R and Python.

1. Plot the out-degree distribution of our dataset (x-axis number of similar products, y-axis number of nodes). That is, for each product a, count the number of outgoing links to another product page b such that a → b.

Hint: The following steps will outline one way to approach this problem.

(a) Start by calculating the out-degree for each product. You may use the table command in R or a dict in Python to compute the number of outbound links for each product.

(b) You can then apply the same process you just used so that you can count the number of products (nodes) that have a particular number of outgoing links. This is the out-degree distribution.

(c) Once you are done, you can use the default plotting environment in R, ggplot2in R, or matplotlib3in Python to plot the distribution. Note that you can avoid step (b) if you use the geom density() function in ggplot or the hist() method in matplotlib. However, you may approach this any way you wish.

2. Above, you should have found that each product contains a maximum of five outbound links to similar products in the dataset. Now, plot the in-degree distribution of our dataset (x-axis number of similar products, y-axis number of nodes). That is, for each product a, count the number of incoming links from another product page b such that b → a. You can use the same steps outlined above. Is the distribution different? Comment on what you observe.

3. Transform the x-axis of the previous graph to log scale, to get a better understanding of the distribution. Note here that you should have some products with 0 inbound links. This means that using the log of the x-axis will fail since log(0) will not be valid. Due to this, you should replace 0 with 0.1. Comment on what you observe.

4. Compute the average number of inbound co-purchase links, the standard deviation, and the maximum. Comment on the result.

5. Report the names of the 10 products with the most inbound co-purchase links.

Solution Approach:

In this project, we aimed to analyze the network and platform analytics of Amazon products using various techniques and methodologies. Below is a detailed overview of our approach:

Dataset Used:

We utilized several datasets provided on Brightspace, including graph_subset_rank1000.txt, graph_subset_rank1000_cc.txt, and graph_complete.txt, along with id_to_titles.txt for mapping product IDs to titles.

Basic Data Information:

The datasets contained information about Amazon products, including co-purchase relationships, product IDs, and titles. We explored the structure of the network, the number of nodes, and edges in each dataset.

Data Processing Techniques:

We employed the igraph package in R for handling graph data structures and conducting network analysis.
Basic data processing techniques such as reading data from text files, creating directed and undirected graphs, and computing basic graph properties were applied.

Feature Selection:

Key features analyzed included out-degree and in-degree distributions, which provided insights into the connectivity and popularity of products within the network.

Method Used:

We utilized various algorithms available in the igraph package for tasks such as graph plotting, calculating basic graph properties, and analyzing degree distributions.

Evaluation Used:

Evaluation was primarily based on visual inspection of graphs, distribution plots, and summary statistics derived from the datasets.

Output :

Some output sreenshot :

Rplot 1 :

Rplot 2:

Rplot 3

Rplot 4

At CodersArts, we specialize in empowering businesses through advanced data analytics solutions. Our latest project focuses on unraveling the complexities of network and platform analytics within Amazon's expansive ecosystem. Leveraging our expertise in data processing and algorithmic exploration, we deliver comprehensive insights to meet your project requirements effectively.

From preprocessing datasets to conducting in-depth network analysis, our team guides you through every stage of the project with precision and expertise. By utilizing advanced tools such as the igraph package and implementing sophisticated algorithms, we unravel the intricate web of co-purchase relationships and network dynamics inherent in Amazon's product ecosystem.

Our commitment to excellence extends beyond analysis, as we prioritize delivering actionable insights that drive business growth. Through meticulous evaluation and visualization techniques, we provide a deep understanding of Amazon's network structure, enabling informed decision-making and strategic planning. Trust CodersArts to navigate the analytical landscape and unlock the full potential of your e-commerce operations through data-driven insights.

If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.