On the 14th of December, some of the most innovative companies in the media and entertainment space presented their data science experiences and challenges at the Data Science SALON in Los Angeles. If you missed this event, here’s a quick summary we put together. Enjoy!
Jonathan Morra, VP of Data Science at ZEFR kicked off the conference showcasing how ZEFR is clustering YouTube videos using supervised and unsupervised learning. For ZEFR, context matters when identifying, filtering and organizing the videos. In this case, supervised learning is a multi-label problem and not a multi-class problem. The data science team at ZEFR performed a featurization with the videos and compared ML (Machine Learning) frameworks to check how well the videos surfaced in a specific situation. For their unsupervised learning they used the Latent Dirichlet Allocation (LDA) Overview. They concluded that even though YouTube clustering is a little difficult, unsupervised clustering is very valuable. See the full presentation recording here.
The Data Science Salon was a unique event. Never have I seen such a diverse audience or set of speakers at a data science event. It was both fun and informative to be encouraged to think outside of my days science box.
~ Jonathan Morra, VP Data Science at ZEFR
Kevin Perko, Data Science Lead at SCRIBD, spoke about SCRIBD’s experience using Deep Learning. SCRIBD’s main focus was on spelling correction to improve their user experience. After a comprehensive framework examination, SCRIBD selected Keras as their framework, because it’s Python-based and its abstraction layer made it easier to initiate. In regards to Architecture, they settled on Neural Networks. For algorithms they selected Sequence-to-Sequence Mapping and used OpenNMT (Open Source Neural Machine Translation in Torch) because of it’s existing libraries/frameworks, open source contribution, and it’s implementation success. For production, they developed in AWS and TensorFlow. They were able to develop a simple, faster algorithm which was usually 65% accurate. To improve the accuracy they created a dictionary matching FTW (seq2seq + dictionary = 90%). In 2018, Kevin expects to tackle projects in query parsing, content summarization, and document classification. See the full presentation recording here.
Xavier Kochhar (Founder of the video Genome project / Hulu) sat down with Leonard Armato (CEO & Tour Commissioner at AVP Pro Beach Volleyball Tour, Inc.) and discussed:
· Why videos are categorized by genre, title, description (before this was unnecessary)
· Metadata and the person: Understanding data and what to do with it
· Systems having to take into account who we are and “when” we are
· Why making data better and more structured is more important than the algorithm itself
· The importance of liberal education in continued learning
See the full interview recording here.
Becky Tucker, Senior Data Scientist at Netflix, shared some insights about the culture at Netflix. She said that data science is 80% data, 20% science. “Data is the new oil,” she highlighted. The culture at Netflix is based on freedom and responsibility, context not control, and their values in judgment, communication, curiosity, innovation, courage, passion, selflessness, inclusion, integrity and impact. Using an analogy with hamburger stands and butlers, she elaborated on the structure of the flow of demands of the data science team. Hamburger stands are service oriented, rarely work solo, fill orders on-demand. Butlers are reliable, efficient, selfless, ahead of guest requests, anticipate needs/problems, understand the customer, and offer quality. She closed her presentation describing the data science ecosystems at Netflix. At Netflix there are hundreds of models deployed for business needs across the company (they have a self-serve platform), they keep a close partnership between data engineers and development teams, and their emphasis is on maintainability (graceful error handling). See the full presentation recording here.
Next up was a panel of data science professionals in a fireside chat on Data Science Applications in Media and Entertainment. The Moderator was Joe Devon (Diamond), and the panelists were Alejandro Cantarero (VP at LA Times Media group), Keisuke Inoue (VP Data Science at Emoji), Gwen Miller (VP Audience & Platform at Kin Community), and Hollie Choi (Exec. Director IT intellectual Property at 20th Century Fox). Some of the most notable concepts discussed were:
· LA Times uses data science to drive audience engagement and a recommendation system. They classify people talking about the same thing and work with the emotionality of headlines.
· Emoji works with mobile stickers. Depending on the message content they suggest “alternative content” and simple NLP (they want to understand what people are talking about by capturing text messages). Their metrics are based on views and sharing, as well as session frequency and updates. They do what is called content exchanging technology: when someone shares a sticker about a movie or other content, they look how people are reacting to that movie. They also do emoji analysis to look for people’s perception and do analysis of content reaction.
· Kin Community studies what times the audience is engaged or disengaged, the psychological impact of ending words (they have a strong effect), and identity content (when people share things they are saying something about themselves). They look at the maximum amount of data they can get when content is streaming. To Kin Community, data is people talking directly to them (not comments).
· 20th Century Fox produces content with a unique ID that makes it identifiable. The info is sourced to a single place and they produce the survival record of each piece of content.
See the full presentation recording here.
Honestly, this was the first “conference” that I actually sat through all the sessions! And I wasn’t even texting. All the talks were uniquely interesting and useful with practical solutions to common problems to new ideas and visions from some of the top industry leaders.
~Keisuke Inoue, VP Data Science at Emoji
Mostafa Majidpour, Sr. Data Scientist at Time, Inc., spoke about a Journey of Deploying a Data Science Engine to Production. He started the talk with a scenario of user browsing a website. Based on the content of the user’s cookies, they have current and previous browsing history and behavior. To achieve predictive modeling and real time/ near real time scoring, there were a few possible approaches: lookup a table (to store scores), code rewriting for deployment, or look for a deployable data science outcome. They selected MLeap as their technology because is based on Python and Scala, it integrates well with Java, supports many transformations and ML models, and it is fast. After a comprehensive comparison they were able to create a model that recommends products to online users, and the proposed system (with SparkML and MLeap) boosted conversion rate by 7% in phase I and 12% in phase II after adding more features. See the full presentation recording here.
Zahra Ferdowsi, Senior Data Scientist at Snapchat presented the challenges of improving performance on Android Devices. She started the presentation explaining that they chose Android because there are 15K Android devices Vs 60+ Apple devices. Snapchat’s goal is to have all features on all phones. The initial approach was clustering devices by marketing name. The issue was there are different models by phone. For example, the Galaxy S8 has 10 models, and each model has variations based on the country of release with no similarities. Therefore, they decided to cluster based on performance metrics. They recommend to always use percentile metrics and not averages. It was difficult to decide between an outlier, a super slow performing device, or just wrong data. What worked best for them was to cluster adding hardware info like CPU, RAM, clock speed, and screen size. See the full presentation recording here.
Daniel Monistere, SVP — Client Solutions at Nielsen was the next presenter. Daniel titled his presentation: “Making Big Data Smart.” Nielsen collects info about every TV and demographics of the national TV audience. Their devices electronically measure both what the set is tuned to and who is watching. They are currently using a Global Television Audience Meter (GTAM) and a Scrolling Text People Meter (STPL). The big data Nielsen manages is not a census of the market, but is based on where the population is. The scope of available data depends on the provider. Nielsen Common Homes determine accuracy of big data sources, correct inaccuracies (in house or via provider), does a panel match by provider, gathers 3rd party data and completes household missing characteristics. Nielsen is end to end committed to quality. They believe that big data + high quality = gold standard, we agree! See the full presentation recording here.
There was a lot of experience in the room and we were able to learn from one another.
~Daniel Monistere, SVP — Client Solutions at Nielsen
Ann E, Greenberg, founder and CEO at Entertainment AI; talked about “The Age of Co-Creation”. Ann shared that her commitment is to create entertainment that teaches us something different than consumption — a democratic cinema to encourage us to move from viewing to doing, to grow stories, not just tell them. Why is this important now? Now we have access to abundant data, smart algorithms, and powerful computational tools. Currently, humans require new forms of storytelling that reflect and inform new ways of thinking. We learned about the concept of Smart Script (Smart Content). Smart Script virtualizes production across time and space, and is collaborative and democratic. Entertainment AI is generating new levels of inclusiveness, micro-engagements, microtransactions, and MEANINGFUL experiences by creating a decentralized Smart Content Marketplace where “connection is the key”. See the full presentation recording here.
Michael Housman, co-Founder and Chief Data Science Officer at RapportBoostAI, talked about “Building smart AI: How deep learning can get you into deep trouble”. In the transition from statistics and econometrics to Machine Learning and from Machine Learning to Deep Learning we lost causal inference (the why). Initially we had transparent models, now we have black boxes. Michael highlighted that AI must be designed so that it does not reinforce existing biases (the how) or ignore causal relationships (the why). We learned about some of the problems with bots: they are unable to sell effectively, they don’t reflect brand identity, and they don’t adapt to user interactions. To resolve these issues, Michael recommends decreasing the instance of formal language since it feels robotic and cold. He reminded us that conversation matters (to trigger positive emotion, responsiveness, and friendliness). As a conclusion we learned that deep learning has unleashed the potential to solve countless problems, but we can’t lose sight of how and why. See the full presentation recording here.
Data Science SALON LA was an invaluable experience that exposed me to many ideas I’ll be bringing back to my work at RapportBoost.
~Michael Housman, Co-Founder and Chief Data Science Officer
Ali Baghshomali, a Data Scientist at Buzzfeed, talked about Content Metrics at BuzzFeed. As the content process is becoming more and more algorithmic, the parts that we as humans control becomes more and more important. His takeaway: “When a measure becomes a target, it ceases to be a good measure”. Now we are just focusing on how to get people to click and the content doesn’t matter anymore. BuzzFeed recommends that we:
· Switch the core metrics periodically to adapt to a diverse portfolio of content
· Compliment the current metrics to address assessment weaknesses.
· Use compound metrics. Like Watch time = views x video length x Avg. View%
Watch this video:https://www.youtube.com/watch?v=Z5vxRC8dMvs