BLOG@CACM
Computing Profession

The Israeli Social Protest from a Data Science Perspective: Part Two

Posted
Technion faculty member Orit Hazzan

Prologue

This post and the previous one were written before October 7th, the day on which Israel's war on Hamas broke out, the day on which reality in Israel changed, and the social protest came to a grinding halt. Nevertheless, these contents are relevant even now, and a great deal may be learned from them for the future as well.

With deep sorrow, I dedicate these two posts to all of the people who were murdered, killed, kidnapped, or are still missing.

I also dedicate them to the brave people of Israel who are fighting for its future as a democratic and Jewish state, based on the values set out in our Declaration of Independence. See, for example, the War Room opened to track kidnapped and missing people (including children) after Hamas' attack. It is led by Professor Karine Nahon and run entirely by volunteers, many of whom have been organizing the social pro-democracy protests in Israel's streets.

To maintain the spirit of the social protest, I have left the two blogs in the present simple tense.

Introduction

In my previous post, I examined the current social protest that is taking place in Israel from the data science perspective, highlighting the challenges of data. In this subsequent post, I discuss the social protest from the data science perspective, highlighting the challenges arise in the different steps of a data science process. In addition to lessons that can be learned for social protests (here, in Israel, and elsewhere in the world), I also propose that the data gathered on this protest be used for data science education purposes, as it raises many issues that are often addressed (sometimes only theoretically) in data science programs.  

The challenge of the data science process

Data collection

For the past nine months, since January 2023 (until October 7, 2023), every Saturday night, about a quarter of a million of Israel's approximately 10 million citizens (the equivalent of 9 million people in the U.S.) gather to protest against the changes in the regime system that the current government is trying to implement. The demonstrations take place at about 150 locations all over Israel: at the centers of cities, on bridges over highways, and at major junctions.

In addition to the Saturday night protests, the protest takes place on a daily basis in a variety of places and forms. A great deal of data is created in all of these events that, for the sake of documentation, should be gathered and stored in real time to create a database whose analysis can be used to design the continuation of the protest, as well as to derive lessons for other protests in other places around the world.

Some of the main questions we should address in this context are: What data should be gathered in real time (for example, WhatsApp posts that might be erased)? What data should be collected retrospectively (e.g., people's reflections on their experiences in a specific event)? What data should be collected over a period of time on a regular basis (e.g., feelings that may change during the period of the protest)?

Automation related to data collection also should be addressed. In other words, a lot of the data can be collected and analyzed automatically in real time. For example, think about an automatic process that transcribes the speeches and transfers the data, in real time, to an algorithm that creates a model of the relationship between the content of the speeches and events that took place the previous week, and the talkbacks that appeared in major Israeli media channels during that week. Such a model can, in turn, predict agreement and clashes between the messages delivered in the protest speeches and the talkbacks published during the week.

This approach can be expanded to many elements of the protest and can automatically create a huge database that can be analyzed in real time, providing important insights for the protest leadership.

Data cleansing

It is estimated that data scientists spend between 80% and 90% of their time engaged in data cleaning. This seems to be the case also with the data gathered in the Israeli social protest. For example, in the case of numerical data, mentioned in the data versatility section in my previous blog, which presents the number of protesters at each location in Israel during the Saturday night demonstrations, a great deal of data cleaning is necessary to eliminate spelling mistakes in the names of locations, to unify different ways in which location names are written, to complete missing data, and more.

Exploratory data analysis   

Many questions can be asked about the data gathered in the demonstrations that can be addressed in the exploratory data analysis (EDA) stage to identify general patterns in the data, prior to the deeper data analysis step in which suitable analysis methods are selected and designed to yield meaningful models.

We demonstrate the application of EDA on the data that describes the numbers of protesters in the Saturday night demonstrations across all locations throughout the 40 weeks of the protest. Even this simple data can be explored from different perspectives and by posing different questions, one of which is how to cluster the locations:

  • By number of protesters:
    • by the magnitude of the number of protesters (a few/dozens/hundreds/thousands/tens of thousands/etc.)?
    • by the accumulative number of protesters?
    • by the average number of protesters?
  • By characteristics of the locations:
    • by the kind of location (city centers, bridges, junctions)?
    • by the number of Saturday nights that demonstrations took place in these locations (every Saturday night? only once? twice? when?)?
  • By times:
    • by the frequency of demonstrations in places where demonstrations did not take place every weekend? And so on…
    • by the dates on which demonstrations took place at these locations?

As part of this step, the data organization can be refined and when preliminary insights are revealed, the decision of which features to include for each location can be updated. For example, while it is clear that data about the location, date, and number of protesters should be included, we may ask whether data about the speakers, the text of the speeches, and the events that occurred prior to the demonstration date, is relevant. And for the speakers, what data should be included on them: only their names, roles, gender, and age? Any additional details? Such questions should continue to be asked to enable a meaningful data analysis process and the construction of a meaningful model.   

Clearly, such an EDA process requires an understanding of the context and can, therefore, serve as a meaningful and interesting exercise for students that demonstrates not only the importance of this step in the data science process, but also the importance of the application domain.

Data analysis  

After the data is gathered, cleaned, and organized, it is analyzed. Here, again, the variety of relevant data analysis methods is impressive. NLP and image analysis are only two relevant methods for the analysis of data collected in the demonstrations for the study of speeches and of photos and video clips, respectively.

This step of the data science process serves also as a wonderful opportunity to highlight the importance of context for the construction of a meaningful model out of this huge mountain of data. Specifically, we must keep in mind the application domain and constantly strive to overcome the domain neglect cognitive bias that may lead to wrong data interpretations (Mike and Hazzan, 2023).

Conclusion

As can be seen, the Israeli social protest can be examined also from the data science perspective. Clearly, not all aspects of the data science process have been discussed in this post or in the previous one, one of which is ethics. If the protest continues after the war, more data will be created (either of the same kind or not) and new insights on the data science process will be gained.

This historic protest undoubtedly will be included in Israeli social science study programs, and so I propose it be integrated into Israeli study programs of data science as well, and specifically into study programs of the computer science component of data science, since computer science deals with algorithms for whose illustration, in many cases, data is invented.

Acknowledgement

I would like to thank Dr. Orna Agmon Ben-Yehuda, Dr. Avital Binah Pollak, Dr. Yael Rivka Kaplan, Ronit Lis Hacohen, Dr. Koby Mike, Professor Karine Nahon, Mr. Yair Palti, Professor Sivan Toledo, and Dr. Uri Zakai for their collaboration and thoughts on various issues related to data collection and analysis related to the current Israeli social protest.

Reference

Mike, K. and Hazzan, O. (2023). What Is Common to Transportation and Health in Machine Learning Education? The Domain Neglect Bias, IEEE Transactions on Education 66(3), pp. 226-233, doi: 10.1109/TE.2022.3218013.

 

Orit Hazzan is a professor at the Technion's Department of Education in Science and Technology. Her research focuses on computer science, software engineering, and data science education. For additional details, see https://orithazzan.net.technion.ac.il/.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More