It’s definitely visible by the first sight that summertime has come to the end. Apart from end demos and PI planning ceremonies there has been EARL 2021 facilitated, just in the middle of this crazy period! I am really grateful that I have managed to reconcile them both and take part in this wonderful event for the second time in a row. Last year edition was amazing. I have written the relation from EARL 2020, which you can see here. I wanted to go with exactly the same approach this year and let you know what’s cooking in R community, describing it even more in detail.
As a small reminder, The Enterprise Applications of the R Language Conference is a cross-sector conference focusing on the commercial usage of the R programming language. If you use R in your organisation, the EARL Conference is for you and your team. Whether you’re coding, wrangling data, leading a team of R users, or making data-driven decisions, EARL offers insights you can action in your company. I am really grateful for having this possibility to participate there for the second time on behalf of my organization. Having this opportunity, I would like to share you 12 reasons why it was worth to participate in EARL 2021. I really hope that in case you didn’t participate this year, my article will help you decide to participate next year.
Reason 1: Seeing how NATS has developed their analytics
To be honest I cannot imagine better intro keynote for this edition than the presentation made by Branka Subotic, the Director of Analytics for NATS. It is the leading Air Navigation Service Provider in the UK handling over 2.5 million flights across UK airspace and the North Atlantic every year. Branka has let us to take a look behind the scenes of how the analytics team of her company has been created. What’s worth to emphasis at this point is the fact that they have managed to establish a great set up from scratch with zero(!) external budget.
We have been guided through plenty of well thought initiatives like trainings and community development within the technologies like R, Python, PowerBI or Databricks. What an inspiring development strategy! I really loved the moment when being asked about R and Python comparison, Branka explained that it’s impossible to choose and they’ve decided to go for both. Her teams choose the language depending on the project and the team members’ preferences. However, the coding language structure and requirements need to be strickly fulfilled whatever you choose.
If you’ve been an attentive listener, you’ve had a great opportunity to take a lot for yourself to apply in the strategy of your company. We had also possibility to get familiar with some of the NAT’s exisiting solutions like the tool for forecasting benefits and performance or Three Dimensional Insight application. Even though they’ve already achieved a lot, they still crave more. I am really sure that under the leadership of Branka, everything is possible – this woman really can catch people attention and motivate!
Reason 2: Learning how to perform Machine Learning in Credit Risk
Have you ever had opportunity to try your ML skills in credit risk? If not, thanks to Eduardo Contreras Cortes from 4most it was possible to see how to do this in accurate and interpretable way. They have achieved this goal by using Decision Tree Based Ensemble Models such as XGBoost and LightGBM. What the team aimed for is to predict customers with payment difficulties. What’s interesting, their original dataset has expanded from 217 to 795 features after feature engineering step!
I have really liked their approach. Complexity added only when needed. No blind trust in the model – they really cared so that business, regulators and end-users could understand their credit scores. One of the main requirements for the solution was so that is was viable not only for IT teams but for the business. Although the presentation was technical, Eduardo has made it really digestable by everyone. I am sure that I am not the only one with such opinion as there have been so many great questions for the Q&A session afterwards!
Reason 3: View at how NHS makes their healthcare open- by building the NHS-R Community
I really loved this one. As I focus on Citizen Data Science Community development, I have found a lot of inspiration there. Chris Beeley from NHS has emphasised the community role in becoming stronger for any industry. Together with his small team they specialize in using R and Python to collate, analyse and report on routinely collected data, in particular patient experience and clinical outcomes in community physical and mental health services. We could have seen some of the applications of R, Shiny and GitHub for training people how to perform analytics on their own.
During his presentation, Chris focused on two main subjects – the power of open source and the importance of knowledge sharing. Together with his team they’re proud to make NHS-R community absurdly cheap, having their ROI absurdly high at the same time. He claims that the community is making people happy and productive, changes the lives of its members and what’s most important – improves healthcare for everyone in the UK. I have found a lot in common with the initiatives we currently develop in Arla. I totally agree with the speaker that there is no better support for better analysis than investment in training and development tools.
Reason 4: Shiny applications for visualization and forecasting in the forestry supply chain
Still skeptical about Shiny usage in data science? Here you are with another case. Tilhill Forestry is the UK market leader in forest and woodland management. The company has decided to collaborate with the University of the Highlands and Islands in a project, facilitated by Innovate UK. Aim of the project was to develop a model to predict harvesting products from a forest. The team has investigated implementation of data-driven approaches to decision making.
It turned out that Shiny app can be a wonderful choice for a data visualization tool with great user experience. The purpose of the app was to create the visual overview of harvesting views, enable assessing new forests and getting familiar with data used in prediction model. Teresa Marti Rosello emphasises Shiny’s advantages such as intuitive interface, interactivity and clear visuals.
Reason 5: Proof that network analysis is more than pretty visuals
After the first break, there was two awaited presentations on network analysis. I really loved the one given by Amit Kohli. It turned out that he is not only the expert in presented topic but also an author of ShinyTester package! It helps to debug Shiny apps during the process itself. You can check it out on Amit’s Github. During his presentation, speaker has introduced us to the range of R packages useful for network analysis such as igraph, visNetwork or tidygraph. What’s great about his speech is that he hasn’t left us just with the theory but also showcased some examples.
For the first example he has chosen nycflights13 dataset, well known for every R user. It was so easy that anyone from the audience could easily understand it, even if it was his/her first time within this topic. For more advanced users he has prepared more complex cases. Thanks to his speech we have been able to embrace the knowledge of community detection using graphs. We have also got familiar with useful functions for searching the shortest path within a graph or using centrality to identify its key nodes. I don’t want to mention all terms and concepts the speaker introduced but it was really a great and valuable time! Summing up, Amit has proved that network analysis not only can be detected everywhere but also stands for far more than just pretty pictures. It’s statistics.
Reason 6: Shiny apps for air quality data analysis
Already not satisfied with previous Shiny showcases? How about using them to provide a workflow to analyze air quality data? Adithi Upadhya is a Geospatial Data Analyst at ILK Labs in India. She has facilitated second presentation about network analysis, proving how powerful Shiny is, combined with R at the same time.
First, she has explained to us how to perform quality measurements to set the scene. Her team has decided to create two applications – first to make public use of the open source air quality data available and second to help the team at ILK to perform quality checks to the high frequency data. The applications allow multiple user inputs and can provide unit of analysis for further study. Moreover, there is a functionality to alarm user on instrumental errors! All of this using near real time quality check of the data. Just great.
Reason 7: Seeing how to use R and Shiny to prevent from traffic collision
I really loved the name of the first presentation after lunch. “Meeting citizens where they R” is such a great wordplay. I was 100% sure that it will about Citizen Data Science concept! I was in such mistake! The presentation was about how to address information access challenges Hong Kong citizens face in local civic participation. I was laughing so hard at myself but believe me – I will for sure use this idea for any CDS initiative! Isn’t it awesome?
The team has built an R package and Shiny app (again!) to improve access to the information mentioned above. They have organically grown and developed agile team of volunteers, uniting under a shared purpose. Such a great example of harnessing the spirit of open-source, user-centred and accessible design. I really loved their energy. It was so visible how much they have enjoyed working on the project. They’ve spotted the problem of the whole society and treated it as an opportunity to practice, learn and create impact at scale – using R!
R package they’ve developed is hkdatasets, available at CRAN, built to support their app development. You can spot some of the package benefits below.
- enables faster loading from Shiny app versus Google Sheets.
- gives better records of what has been changed in version control.
- encourages others to produce analysis and visuals based on their open-source datasets.
They have not only prepared their presentation using storytelling methods but also shared lots of technical details valuable for more experienced recipients.
Reason 8: Getting familiar with the data science hierarchy of needs in Bank of England
This presentation was not technical at all, for a change. I need to admit that Daniel Durling, the speaker, has really specific sense of humor, but I have liked what he presented. He has proved that you can perform basic data science without a degree or years of experience. It perfectly fits up the Citizen Data Science topic, I have mentioned many times today. There is so much to be gained by helping people start their data science journey. This is the essence of how should professionals act today – help people!
I really loved the data science hierarchy of needs Daniel has showed us during his speech. Unfortunately I don’t remember where has he got it from but it’s really worth applying for any organization. I cannot share the presentation slides of course but it was data collection and all its operations that formed the base of the triangle. AI, deep learning and complex ML solutions are on the very top of it and it’s something worth keeping in mind.
Reason 9: Detection of protected characteristics bias in Machine Learning using R and Shiny
Another presentation was facilitated by Gwilym Morrison, the Head of Analytics & Data Science at Royal London. Together with his small team, they have participated together with Bank of England and Financial Conduct Authority in AI Public-Private Forum to assure good regulation in finance, following innovations in ML at the same time. During the project they’ve considered risks from 3 perspectives – to consumers, firms and the systematic risks.
Some of the risk examples Gwilym introduces as to were:
- The risk of power imbalance that AI applications can bring.
- Fairness and bias in machine learning
- Facial recognition bias across various ethnicity groups
What Gwilym’s team has done is a Shiny app predicting whether they should accept or decline customer. They’ve analyzed it deeply to investigate all possible biases. The app is great as it gives you the information of the source of the bias, where it might be coming from. They’ve spotted that the bias grows for Asian people in their application.
The most valuable conclusion from Gwilym’s speech is the point that machine learning is always deeply connected with discrimination. That’s what the models do but it’s our task to make sure if the discrimination we have obtained is required or if it’s a bias. There are also bias that are controversial but true. We need to spot them coming directly from data and prevent unfair behavior. It’s really a challenge. I really loved Gwilym’s presentation as it’s emphasises the importance of our work and gives direction that not only Royal London should follow but also every data scientist in the world.
Reason 10: Great tips on how to make R packages part of your team
The last presentation before the keynote was about how to use R effectively by organization, facilitated by Emily Riederer from the Capital One. She claims that internally developed R package can be as much helpful as a new colleague. Comparing to the open source packages, internal ones are far more specific and concrete in the problem definition because they know the questions and issues our organization face.
To understand more the benefits, let’s get through the ideas provided by Emily. Internal packages can be:
- utilities packages – enabling data access, server connection, proxies, ssh or ssl
- analysis packages – right problems, tribal knowledge, intuition, e.g. curated workflow, tailored function calls, automated results generation
- developer tools – team norms, meetings, communication like color palettes, Shiny modules or git hooks.
She hasn’t left us with the theory but also proved it by examples of get_database_conn and viz_cohort functions that has been customized for the organization’s needs. It was really a wonderful way of sharing knowledge of R packages comparing it to the person (IT Guy, Junior/Trainee and The Tech Lead)! This presentation will for sure stay on my mind forever as the example of wonderful storytelling application.
Reason 11: Keynote by Dr. Jacqueline Nolis
Although this presentation was the last one, it was definitely something I was waiting for from the very beginning. From the moment I have seen this point on EARL 2021 agenda, I knew that I need to participate. Dr Jacqueline Nolis is one of the biggest inspirations in data science world for me. I have read her book, written together with the second role model of mine – Emily Robinson – and it’s one of my favourites. Not only I admire both for their experience but also for the excitement their share about data science. It’s really visible from the first sight that they love what they do!
What was beautiful about Dr. Nolis’ presentation was that she has walked us through not only specific types of risk in data science but also given her real life experiences with projects that have been successful and failed. As data scientist we need to face an additional aspect of it – sometimes a promising project is simply not feasible!
We need to remember that even if we fail, it doesn’t mean that it’s always our fault – especially in data science. The project can be cancelled for various reasons and we cannot predict everything. Deciding whether to start a project is like gambling. Even if something seems to be a good idea, it can become impossible at some point. Poor Product Owners!
Summing up, Dr Jacqueline gave me this thought that not everything we’ll do in future will be successful. It doesn’t matter how hard you try, sometimes it’s just out of your control. The only thing you can to is to learn from both ups and downs. The more pitfalls we overcome, the better we are in managing hard situations.
Reason 12: Wonderful workshops!
EARL is not only about inspiring presentations. It also consists of great workshops. This year Mango Solutions has prepared four sessions for the participants:
- Introduction to Shiny
- Package Development in R
- Functional Programming with Purrr
- Web Scraping and Text Mining Lyrics in R
I have participated in the last session. Daniel de Bortoli and Andrew Little from Mango, have showed us how data scientist can generate own datasets on any topic of interest. In this workshop, they’ve demonstrated a full text analysis workflow in R using lyrics from most popular songs on Spotify. First we have used Spotify API to extract top songs data. It wasn’t the new topic for me as I have already performed similar task for my own experiment at the beginning of the year for this article. However, it was totally worth revising the knowledge and updating it with more details and tips. Combined with css and html skills – it’s really huge!
Then we have learned how to scrape lyrics from the web for our selected songs. We have covered not only web scraping, but also cleaning and pre-processing text data, topic modelling, word embeddings and sentiment analysis. All of this finished with supervised learning to check if we can predict musical genre from lyrics.
I am really happy to have the session recorded and being able to access the materials as the pace of the sessions was really fast! Daniel and Andrew did really great job on their project and this only convinced me of the possibilities offered by machine learning in R.
As you can see, I have really enjoyed EARL 2021. What makes me even happier is the possibility to compare my attitude and knowledge level participating in the event last year and today. It’s still a lot to learn, I am still a newbie, but the feeling of all the puzzles inside my head coming together with every month of hard work is really best feeling ever. I have gained enormous amount of inspiration for the next projects and ideas.
I am thankful to my company for letting my participate in EARL for the second time. Who knows – maybe third time lucky and I will join as one of the speakers next time? I know it’s ambitious but I would really love to challenge myself and check how the conference looks from the other perspective. Please keep your fingers crossed for me as I really live on a mission to pass on valuable things to the world and this industry definitely makes it possible. 🙂