Methods

Data Sources

UNHCR

We gathered data from a wide range of reputable sources. For raw data on refugee movement with statistics like total population of refugees and monthly asylum-seekers, the UNHCR provides open-source datasets for analysis. Gathering UNHCR data was only a matter of downloading datasets from the UNHCR website.

The data-sets we used were the Persons of Concern and the Asylum-Seekers (Monthly Data).

News articles

Initially, we wanted to focus on five popular UK magazines that were associated with political parties on polar ends of the politico-economic spectrum. We carried out initial web scraping using The Guardian's API, aiming to do the same for the rest of chosen articles as well. We obtained JSON files with links to articles containing the word "refugee" published in the magazine in 2017. However, at the point of parsing the files to get all of the links into one file ready for further parsing, we realised that this method was too time-consuming and would significantly affect our performance, purely due to the run time of the code we produced.

Chiara suggested we should use Lexis®Library, which is a database with news and business information from a range of sources, including UK national and regional newspapers. We specified our search specifications to articles from UK magazines containing the word "refugees" in the body of text that were published between the 1st of January 2017 and the 31st of December 2017. The articles had to be longer than 500 words. One limitation of this database is that it does not allow filtering out duplicates—it only allows grouping them—and that it only allows for searching a thousand articles at once. To bypass the latter limitation, we decided to gather the articles in two sets based on halves of the year. Then, we encountered another limitation of the database, specifically the number of articles it allows to download at once - 500. Again, we split this into two parts. We obtained articles in four different html files ready to be scraped.

Unfortunately, this was beyond our knowledge of Python and we consulted our professor, Steven Gray, for help. He helped to write the majority of this code, but the articles needed further cleaning, which had to be done manually by looking for each item that needed to be replaced, fixed, or deleted.

A limitation of using this database that we were not able to overcome was having duplicate articles in the dataset. Attempts at dropping duplicates based on the article text using Pandas resulted in some articles being completely deleted from the data frame, while other duplicated articles still remained in it. You can find the full explanation of what was happening here.

Twitter

For social media sentiment analysis, we focused exclusively on Twitter data. We consider tweets from public accounts to be public knowledge and therefore are not subject to privacy concerns. We also considered Twitter discourse to be the most appropriate to analyse, as discussion are not separated into posts and comments, but each expression is a post (a tweet) in itself. Therefore, we are able to gather the sentiment of Twitter users conveniently.

In scraping tweets, we found that using Twitter's API and the tweepy Python module was far too limiting. Twitter set limitations on the period, for which tweets can be scraped off of the website using its API to a week. As we needed to focus exclusively on tweets from a couple of days in 2017, we could not have used it.

Instead, we used a Python programme called Get Old Tweets Programatically, written by Jefferson Henrique that allows to get tweets older than a week, as well as bypassing some other limitations of the Twitter API. More specifically, this programme is able to accept Tweet attributes and uses Twitter advanced search (example here) to query tweets in order to return them. Conveniently, Henrique also included an 'exporter' application which returns all tweets into one neat csv file.

We made minor changes to this small exporter application, including adding quotation parenthesis for each field and changing the column separator in order to minimise data loss, due to many tweets containing quotation marks and commas, which is problematic for the computer.

As we were interested in the dates around three terrorist attacks in the UK, we ran the exporter application in the terminal with attributes for the dates, the search term ('refugee'), and the location (in England). Here is an example:

python Exporter.py --since 2017-06-02 --until 2017-06-05 --querysearch "refugee" --near England --output LondonBridgeAttack.csv

Disclaimer: we are unsure if the attribute --near England really worked, as tweets were returned with no geo-data and some tweets seemed to be suspiciously foreign. However, looking at the histograms of tweets, we consistently see a parabolic trend where it goes to the minimum after midnight (BST) and rises again at around 10am (BST), as if following a sleeping pattern. We therefore make a cautious assumption that the Tweets are indeed only from England.

Methodology of Sentiment Analysis of articles and tweets

We elected to use VADER (Valence Aware Dictionary for sEntiment Reasoning) sentiment analysis to conduct our analysis. All details on the methodology of analysis can be found at https://github.com/cjhutto/vaderSentiment.

Essentially, VADER analyses sentiment based on lexicon and pre-trained rules. It is able to understand the polarity (positive/ negative) as well as the intensity of a text. It then computes a 'compound' score, calculated by summing the valence score of each word in a lexicon, adjusted according to pre-trained rules, and normalised to a value between -1 and +1 (extreme negative to extreme positive). A score of between -0.05 and +0.05 indicates a neutral sentiment.

Here is an example of a few tweets and their calculated VADER sentiment score:

The same code was run for both articles and tweets and it can be visible in the snapshot from Jupyter Notebook below.

Then, the sentiment scores were categorised according to the rules explained before as strongly negative, strongly positive, slightly negative, slightly positive, or neutral. The code for this can be found below. Finally, the new data frame was saved as a csv file for further analysis.

Visualisation Techniques

Article Sentiment Visualisations

The Jupyter Notebook with the step-by-step process of creating these visualisations can be found here under 3. Sentiment analysis on media articles/3.1 Visualising media articles.ipynb

Bar chart of the number of articles depending on their sentiment score.

We visualised with Plotly due to its interactivity, as well as how complete the visualisation program is. Official Plotly guide for bar charts can be found here: https://plot.ly/python/bar-charts/

Time series of average sentiment scores given to articles

Initially, we plotted a time series graph for all sentiment scores given to articles, but naturally this graph proved to be messy and did not say anything meaningful about the data. The reason for the way it looked was that there could be several articles published on the same date and their sentiment scores were most likely all different, so the data was not structured and the function got confused. A way to make it clearer is to take average sentiment score for each day. The code for the average sentiment score and the following visualisation can be seen below.

Bar chart of the number of articles published by each magazine that was selected based on the frequency in the dataset

This was a pretty straightforward graph to make. It required creating a new data frame with the magazines that appeared in our data set most frequently:

Then again, a bar chart was created using plot.ly:

Time series of the number of articles published by a publication

This was a trickier graph to make. New data frames with the dates as indices and the number of articles published by each magazine corresponding to the dates needed to be created. There were magazines that did not post on some days, so this needed to be done by first creating separate data frames for each magazine and then joining these data frames and filling NaNs with 0s. One obstacle that I encountered during this process was that the function "value_counts" does not work on "datetime" dates and hence all of these dates needed to be converted into strings using "applymap(str)" found in here.

This was then put on a graph using plot.ly:

Percentage distribution of sentiment scores for each of the magazines

Firstly, new data frames with the percentage distribution of sentiment categories ("Description") were created. Here's the code with a sample output:

Then, these were visualised using stacked bar chart on plot.ly:

Twitter Sentiment Visualisations

Stacked histograms of Twitter sentiment of refugees

The 'stacked histogram' is actually a stacked bar chart in disguise. To create this, we had to create four datasets for the four bands of sentiments (slightly/strongly positive/negative), each splitting its tweets into 10 minute intervals and counting the number of tweets inside that interval. Here is the code and output:

Again we used plot.ly for visualisations.

Here is our code for visualising:

This process of data preparation and visualisation is repeated for each selected terrorist attack. It could not be done in a Python 'for' loop because it involves the cleaning of different dataframes. However, simply using Find and Replace All commands made repeating this process far simpler.

Natural Language Toolkit Lexical Dispersion Plot

Python's NLTK provides a few useful NLP visualisations. We visualised with a lexcial dispersion plot because it goes well with time series or chronological based data.

We first selected 16 words which we wanted to visualise and analyse. We decided on: asylum, seekers, flood, wave, influx, swarm, welcome, migrants, immigrants, illegal, conflict, terrorist, terrorists, terrorism, and two geographical words related to the each terrorist incident (e.g. for the London Bridge attack, we visualised 'london' and 'bridge').

We decided on these words after research showed that these are the keywords that tend to appear often in discourse about refugees.

With lexical dispersion plots, we had to create nine plots. Three for each terrorist attack, divided into three sentiment bands (positive, negative, neutral).

Here, we show code for beginning how to prepare code for a lexical dispersion plot.

This code represents creating a dispersion plot for neutral sentiment tweets around the London Bridge attack (stored as dataframe df1):

This code is repeated for each sentiment band (positive and negative), and then it is again all repeated for the other two terrorist attacks.

Word clouds

Visualisations of word clouds were carried out in PyCharm as we experienced troubles importing the library into the Jupyter Notebook. The words used for creating dispersion plots were saved as text files and then imported into PyCharm. Running the following code for the first time showed that this file was still not clean, as there were words like "youtube", "twitter", "com", etc. on the word cloud. Because just replacing some of the words/expressions would have meant that words that have these words/expressions in their body, would be changed as well, the cleaning of the text file was done manually. The full code along with files can be found here.

Refugee Crisis Visualisations

Tableau Map Plot

The Tableau Map was created using this resource: https://www.youtube.com/watch?v=ckQNNhCfUW4 .

Plotly annotated heatmaps

https://plot.ly/python/heatmaps/

Tableau Map Plot

https://www.youtube.com/watch?v=ckQNNhCfUW4

Bokeh Map Plot

https://www.youtube.com/watch?v=P60qokxPPZc

Folium Map Plot

https://blog.prototypr.io/interactive-maps-with-python-part-1-aa1563dbe5a9

Alternative Visualisations (Shortcomings)

There are some alternative visualisations that we considered implementing in fully but encountered limitations and or issues.

Map visualisations

The original map visualisation seen on the first page was created in Tableau Desktop and published to Tableau Public. The reason for this is that Tableau is a far more complete data visualisation tool than any other Python visualisation library, and this is said after many many hours of trialing different visualisation libraries.

Bokeh

To begin with, we tried implementing a map with the Bokeh library. Specifically, we chose to plot the map on a Google Map (GMapPlot) as to provide the user interactivity in scrolling through the map, which would allow country names to be hidden until a certain zoom level is reached. A GMapPlot also appeared to be the only method to accept longitude latitude co-ordinates on a Bokeh map. This was preferred over the alternative Bokeh method of having to convert co-ordinates into positive x-y co-ordinates, which would then be plotted over a static image. The static image itself has a limitation in that

However, the Google Maps places many restrictions on GMapPlot. In the least, the height and width of the map cannot be determined. instead plotting a fixed square map that makes it incredibly awkward to navigate. Another issue with Bokeh's GMapPlot was that the plot points isn't fixed when a user pans the map; although this is more of an aesthetic shortcoming.

The biggest issue perhaps with GMapPlot is that scrolling to the pacific ocean and crossing the boundary where longitude goes from positive to negative (boundary between East and West in the modern world) breaks all of the plots completely; as long as this section of the world was in view, the plots would all break completely. There didn't seem to be a fix for this, despite our efforts to implement a pan boundary or minimum zoom limit.

Another problem with Bokeh (or an incredible advantage of Tableau) is the ability to filter and query information from a dataframe.

Here is our honest attempt at Bokeh in all its glory:

Origin of refugees arriving in the UK, 2017

with Bokeh

Folium

Folium was a far more successful venture than Bokeh and GMapPlot. It had the ability to plot longitude latitude co-ordinates that we desired, and the consistency of plots staying in place. In addition to this however, Folium also enables a 'pop-up text' box to provide additional information about a certain plot.

In general, Folium was largely a successful visualisation. However, it still falls far short of Tableau's capability. For one, implementing a time-slider would only be possible with heat maps (Bokeh's HeatMapWithTime module), which requires data on far more precise co-ordinations than general country co-ordinations. Secondly, Python libraries simply lack query capabilities that would allow the user to filter a specific country to view data on only that country.

Here is our Bokeh map visualisation:

Origin of refugees arriving in the UK, 2017

with Folium

Click on a circle for additional information

Understanding Refugee Movement

We created heatmaps to visualise when then busiest traffic of refugees were based on the month of year over the years.