Before the advent of social media, voicing documented opinion was the preserve of a few people, generally called the "opinion makers". But in today's day and age, anyone anywhere with an internet connection can register her likes, dislikes, endorsements at minimal cost and effort. This information, when collected on a large enough scale, can help in building models that help gauge what these "opinion makers" are thinking. And these people matter because they are the consumers - of everything from Hotstar subscriptions to airline tickets to financial products. The product that we have conceived is web tool that lets users know the sentiment of viewers for a movie trailer on Youtube from the date of its launch to the date of release of the movie. It will help people as well as production houses monitor what people are thinking about upcoming movies. Though it sounds like a simple task, the complication with Youtube is that the comments are not necessarily structured using proper grammar and vocabulary. This makes it tricky to deduce the sentiment from the comment. Also, in India most of the comments are in Hinglish, which coupled with the bad grammar and varied vocabulary make it even harder to wring the sentiment out of the comments. The above graph depicts the behaviour of top five positive and top five negative words based on their respective L2 penalties over their weights.
L2 penalty is a proxy for estimating the affect a word has on the sentiment of sentence. As you can see when L2 is close to zero the weights were spread over a large range for the respective negative and positive words. But this is not scalable for finding complicated decision boundaries (a problem very intrinsic for sentiment analysis on Youtube comments) for large datasets. The decision boundaries establish the difference between the positive and negative character of the words. This characteristic in turn lends the sentiment to a sentence. As we move to the center region of the graph, the words have a smaller spread based on their weights. This is good for scaling up a model for large and complicated dataset such as the one we are dealing with and building a deeper neural net. Also, as the words have lesser spread it becomes easier to extend our vocabulary and mark a large dataset for supervised sentiment analysis.. Meanwhile, the right hand side part of the graph depicts L2 weights converging to zero, which is as expected. Although, this is the most popular version of regularisation used by AI practitioners we are working on other novel methods such as dynamically rotating penalty parameters using Langrange multipliers.
0 Comments
Recall when was the last time that you used Google Search, Facebook, Twitter, Youtube or Instagram. Probably you landed on this page from any of those websites only. Now, try to recall when was the last time you paid for any those services or for that matter any of the thousand such services which are practically free. The answer would be never. This dilemma spawned the business of online ads that crowd up our browser windows. With the deluge of content reducing our attention span to seconds this revenue model is also facing immense pressure. The reason: Online ads work earn money when someone sees them or clicks on them. With attention being scarce both of them are getting tougher to get. Ads are a major portion of this commission/royalty business, with personalised mails and suggestions on e-commerce websites making up for the latter. Therefore,Similarly Using this algorithm can help banks and non-banking finance companies to monitor the risk emanating from present borrowers and gauge the repaying ability of thin- and no-file customers, who have come to the bank for their first loans. Although this sounds tailored for the lending segment, the algorithm in its foundation is a classification process that is able to tell the good from the bad, the better from the best and so on. Therefore, it can deployed in other sectors also such as insurance and telecom. In case of the former, by analysing consumer data and policy features for an insurance aggregator we can help them recommend the best policies personalised according to the needs of the consumer. This will help reduce cost of lead generation and significantly improve the probability of customer acquisition. In the telecom sector, classification can be used to detect frauds on the network. Our algorithm can be used to learn from call detail records where and why there are network glitches. This can help telecom companies detect, localise and isolate problems on their networks. Therefore, despite the primary goal being the sale of a product there is a lot of money to be made before that happens in case you exactly what is worthy of a click. This has become possible with the pervasive use of cookies, which track what exactly you are up to when on the internet, and machine learning algorithms that can analyse huge datasets and come up with optimum predictions. This will be implemented using a combination of methods that can segregate pictures, analyse text and classify products. In case of images this will be accomplished by using well known clustering algorithms. We successfully manage to separate pictures without having to tell the computer what they are. Consider the pictures below. The segregation happens as given in the panel below. Observe how the similar coloured sots cluster together. This process was executed 20 times due to lack of computing power, with more iterations the separations become clearer. In order to analyse and classify text we deploy a method called text fingerprinting. This will analyse and categorise similar words together on the basis on which they have been used and not just their obvious meaning. Our model has learned to fingerprint text on the basis of a large number of documents uploaded on Wikipedia. Similar words are termed to be close to each other in terms of distance. The table below gives a snapshot of the optimised results. As can be seen former US president Barack Obama is predicted closest to his vice-president Joe Biden. You can also see Joe the Plumber, Obama's conversation with whom turned out to become a very popular story. The clustering of words can also be seen through the relation between the radius of query and the number of documents, the number of queries and the distance between the words from the following plots. The last level of classification, that for the products, will happen on the basis of the metatags that describe that describe these products and their sales history.
The final model will combine all these three algorithms to come up with an integrated solution. Over the past few years, artificial intelligence started making inroads into financial sector be it lending or trading. According to CB Insights, financial technology firms closed funding 496 deals worth $8 billion in 2016, a record high. And needless to say many of these firms are using some form of data analytics to provide the solutions that they provide - be it general purpose companies such as Opera and Kensho or credit research firms such as Affirm and Avant or insurance sector start-ups such as Lemonande and Cyence.. In the present case, we will focus on lending segment. Through artificial intelligence it is possible to make the process of judging whether a borrower is going to honour the loan significantly insightful and easy. This will be done by letting the machine, in our case a computer or a web server, read all data that is available on past borrowers. Our algorithm will then learn how to segregate good and bad borrowers on the basis of their repaying habits. It will also glean out the most important characteristics that affect whether or not the borrower pays back on time. This algorithm takes these fields as input:
Our algorithm has been trained on the Lending Club dataset, now widely used by people around the world to benchmark their solution. In our case the performance is given by the graph below. Though it may seem gibberish to many it essentially tells how good are we in predicting whether a loan will bad or not, a metric called the non-performing asset in India. The key characteristic that this graph depicts is that the test and training error stabilises after fewer number of iterations. Similarly Using this algorithm can help banks and non-banking finance companies to monitor the risk emanating from present borrowers and gauge the repaying ability of thin- and no-file customers, who have come to the bank for their first loans.
Although this sounds tailored for the lending segment, the algorithm in its foundation is a classification process that is able to tell the good from the bad, the better from the best and so on. Therefore, it can deployed in other sectors also such as insurance and telecom. In case of the former, by analysing consumer data and policy features for an insurance aggregator we can help them recommend the best policies personalised according to the needs of the consumer. This will help reduce cost of lead generation and significantly improve the probability of customer acquisition. In the telecom sector, classification can be used to detect frauds on the network. Our algorithm can be used to learn from call detail records where and why there are network glitches. This can help telecom companies detect, localise and isolate problems on their networks. |
AuthorNullpointer consists of Akshay Bharti and B Sundaresan. ArchivesCategories |