Search This Blog

Wednesday, August 28, 2013

Analyzing Big Data with Twitter

http://blogs.ischool.berkeley.edu/i290-abdt-s12/

Marti Hearst, the course instructor at UC Berkeley, introduces the main concepts for the course, and Gilad Mishne (@gilad) of Twitter describes his goals for the course and provides an introduction to Twitter.  (slides for lecture1a)  (slides for lecture1b).

Twitter Philosophy and Software Architecture

Othman Laraki (@othman), Twitter’s Vice President for Growth, International and Revenue, on Growing a Human-Scale Service, and Raffi Krikorian (@raffi), the Director of Twitter’s Platform Services group, on the Twitter Software Ecosystem.  View the slides for Othman‘s and Raffi‘s talks.

Introduction to Hadoop

Bill Graham (@billgraham), who is active in the Hadoop community and a Pig contributor, gave a very clear and detailed intro to Hadoop and outlined how it is used at Twitter. His slides can be found here.

Introduction to Apache Pig

Jon Coveney (@jco) gives an in-depth tutorial on Apache Pig, including how it interacts with Hadoop.  The log analysis group at Twitter uses Pig extensively.  Jon’s slides can be found here: (pdf)

Coding to the Twitter API

Rion Snow (@rion) gave an introduction to the Twitter API, including the RESTful API and the streaming API for both Java and Python.  See all the slides (no video).

Slide on Sampling the Streaming API: Twitter4J

Detecting Twitter Trends

If you’d like to know how Twitter computes its Trending Topics,  Kostas Tsioutsiouliklis (@kostas) shared some of the secrets with the class.  He also talked about MinHash algorithms. See his lecture notes.

Real-Time Twitter Search

Brian Larson (@larsonite), the tech lead for search and relevance at Twitter, gives a detailed technical talk about how real-time search works at Twitter.

Splunk’s Software Architecture and GUI for Analyzing Twitter Data

Stephen Sorkin of Splunk described alternative software architecture for processing large data. Splunk also has a sophisticated GUI for analyzing Twitter and other data sources in real time; be sure to watch the last 15 minutes of the video to see the demo.  Stephen’s slides: pdf

Twitter’s Social Network

Learn about weak ties, triadic closures, and personal pagerank, and how they all relate to the Twitter social graph from Aneesh Sharma (@aneeshs) in this lecture.  Slides here.

Big Learning with Graphs

Joey Gonzalez, a recent PhD from CMU and a postdoc at UC Berkeley, is working on GraphLab, the hot technology for processing huge graphs quickly.  There is new a version called GraphChi (for chihuahua) that you can run on your personal computer; so you don’t even need access to EC2 to run it going forward.  Slides here.

Twitter Recommendations

Alpa Jain (@alpa), who works on monetization algorithms at Twitter, described SVD and other recommendation algorithms used at Twitter.  Alpa’s slides are here: pdf

Security at Twitter and Elsewhere

Kurt Thomas is a former Twitter engineer and a current PhD student at UC Berkeley who studies how the criminal underground conspires to make money via unintended uses of computer systems.   He talked about fraud detection for Twitter and other online systems.  See his lecture notes.

Information Diffusion on Twitter

Stan Nikolov (@snikolov) of the Twitter Search and Relevance team walked  through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure.  The slides in his Lecture Notes let you see the Pig scripts in detail, and you can see the video simulatinos that Stan created on his blog.

Introduction to Scalding

On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin (@posco) presented a lecture that he and Argyris Zymnis (@argyris) put together. See the lecture notes for more details.

Spark: Making Big Data Analytics Interactive and Real-­Time

Spark is the hot next thing for Hadoop / MapReduce, and  Matei Zaharia (@matei_zaharia), a PhD student in UC Berkeley’s AMP Lab, described how it works and what’s coming next.  The key idea is to make analysis of big data interactive and able to respond in real time.  Next up in the research agenda is streaming data and blending real time and batch processing.  Matei also gave a live demo. (slides here)

Course Wrapup

These lecture notes simply summarized the course at a high level.  Thanks for the great semester!
Posted in Uncategorized | Comments Off

Project Talks and Demos: Tues Dec 11 4pm-6pm



Analyzing the Twitter Conversation and Interest Graphs

For assignment 3, students analyzed and compared a portion of the Twitter “conversation graph” and the “interest graph”.  Conversations were found by looking for Twitter “@mentions” and interest graph by looking at the friend/follow graphs for a user (finding friends of friends, taking a k-core analysis, and closing the triangles). The attached document  highlights many of the students’ work.
One of the most impressive graphs was made by Achal Soni.  He used  Java and the Twitter4J library  to obtain 3000 tweets for 4 rappers (Drake, Kendrick Lamar, J Cole, and Big Sean). He extracted @mentions from these tweets, and created a graph recording edges were between the celebrities and who they were conversing with.

To compare the interest network with the conversation network for these celebrities he looked at the friends of each of these celebrities, and classified all of these relationships into 3 categories per celebrity: users who they followed but didn’t mention, users who they mentioned but didn’t follow, and users in the intersection. This was repeated for each of the celebrities.
To see what it would look like to compose the relationships for all four celebrities into a single network, he input the combined graph into Gephi.  The results were surprising, and phenomenal!
The layout algorithm beautifully clusters each of the celebrity’s networks together, and keeps the shared components in the middle.  Each network is composed of nodes of different colors (3 colors each). Each of these colors represent which category  the node is associated with belongs to (whether it is a friend of the respective celebrity, somebody who he has conversed with, or both).  For example, Big Sean’s network is primarily blue, as opposed to dark orange or purple, which means he talks to a lot of people whom he does not follow.    It is also interesting to see which users are in the intersection of the different hip hop artists, and a quick examination of a few of these indicate that they are actually mainly other big figures in the hip hop scene (artists and dj’s, etc.).

More Election Day Twitter Analysis and Visualization

Finishing up with last week’s in-class assignment to respond to election-day twitter activity, some other great work from the class:
Priya Iyer processed a collection of tweets just before and after the election returns and pulled out the most frequent words.  She then ran them through the wonderful Wordle program to create a visualization where size indicates frequency (color was chosen by the program, and coincidentally (?) choosing blue and red for “obama” and “romney”.
Before the election:

Just after the returns:

Below is a heatmap of the locations of people who tweeted with the hashtag #Ivoted on election day, based on their geolocation information, by the BandHype team.

Derrick Cheng and Katrina Rogan made a video of their app running that shows the proportion of tweets that talk solely about Obama and solely about Romney.  Here is a screenshot:

Check it out!
I think this was a great exercise, as we all learned from each other a bit more about how to get our apps online and get faster at doing our coding, which should help with the final projects.


Video Lecture: Spark: Making Big Data Analytics Interactive and Real-­Time by Matei Zaharia

Spark is the hot next thing for Hadoop / MapReduce, and yesterday Matei Zaharia, a PhD student in UC Berkeley’s AMP Lab, gave us a terrific lecture about how it works and what’s coming next.  The key idea is to make analysis of big data interactive and able to respond in real time.   Matei also gave a live demo.  Watch here:
Next up in the research agenda is streaming data and blending real time and batch processing. This research team is doing truly amazing work  work and we’re very grateful to @matei_zaharia for the clear and impressive lecture! (slides here)

SF Map of Obama Victory Tweets

Another great result of the in-class assignment to do something with tweets on election day (good thing I gave students 24 hours to turn it in!) since Arian Shams and Guarav Shetti managed to run their tweet capture code for 30 minutes right after Obama was declared the President Elect a second time.  Great job, students!

Election Tweets From Restaurants

Students in the Twitter class were challenged yesterday to write some code in class to do something interesting with election day tweets.  One group leveraged the code they are developing for their final project to quickly build an app that what people are saying about the election who are near (or claim to be near) Berkeley restaurants.  Now it is changed to people contemplating or celebrating the results.
Give it a try!


Video Lecture: Intro to Scalding by @posco and @argyris

On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin presented a lecture that he and Argyris Zymnis put together for us:
Because scalding is built on the functional programming language Scala, it has an advantage over Pig in that you can have the equivalent of user-defined functions directly in your code. See the lecture notes for more details.  Be sure watch the video to get all the details especially since Oscar managed to make us all laugh throughout his lecture. Thanks guys!

Video Lecture: Information Diffusion on Twitter by @snikolov

Today Stan Nikolov, who just finished his masters at MIT in studying information diffusion networks,  walked us through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure (from the popular Easley and Kleinberg Network book).   Stan also gathered a huge amount of Twitter data, processed it using Pig scripts, and graphed the results using Gephi.  The video lecture below shows you some great visualizations of the spreading behavior of the data!
The slides in his Lecture Notes let you see the Pig scripts in more detail.
You can see the videos that Stan created on his blog.
For those who want the details before watching the video, this is a threshold-based model for people choosing to do A or B based on what their neighbors did, modeled as a coordination game where if neighbors pick the same thing, they get a payoff.  Even though spread of topics on Twitter is not quite the same kind of coordination game, Stan tells us the threshold model is very popular independent of the game-theoretic justification.
Easley and Kleinberg ask what is it about the structure of the network that cause something to keep spreading or stop spreading. They prove that clusters defined in terms of cluster density stop spreading and that, in fact, they are the only thing that stops spreading.
For his data, he got  some trending hashtags (the type that are memes or word games, like #ThingsYouSayToYourBestFriend, not the type that are events, like #debates), and recorded how they spread before the hashtag actually becomes a trending topic (so that the dominant mode of spreading is from person to person, not from some exogenous event, or from the hashtag being on the trending topics list).
Information diffusion in networks is a really difficult topic to work on empirically, so Stan, thank you so much for this terrific work!