Why ‘small’ still matters in big data

Posted In Projects, Publications - By Avijit Paul On Sunday, March 31st, 2013 With 0 Comments

Big data is big news in almost every sector including crisis communication. However, apart from having limited access to big data, we often do not have necessary tools to analyze and cross reference large data sets for verification and validity in order to use it for Crisis Communication. Thus we are often left with smaller dataset that we can gather from various sources. This, apparently is not a huge limitation as it appears that in crisis situations the important data can still be found from datasets that we can gather using currently available tools.

When we analyzed 164390 tweets collected during 2011 Christchurch earthquake using yourTwapperKeeper to find out what type of location specific information people mentioned in their tweet and when do they talk about that, we found that the key areas can still be identified from the small datasets. In the following section I will break it down in smaller segments.

Finding key areas

When we searched for location (or named entity) mentions every half an hour we find the pattern that the larger or known areas were mentioned lot more than the lesser known areas. In the following image we can see that people talked about the larger area first as they were just getting to know about the crisis and they may or may not be familiar with smaller area such as Lyttelton.

Using key areas to narrow down specific area

Nonetheless, finding the large area is useful for primary identification. By locating the bigger area, one can identify that the specific mentioned area that may share common name with another place falls into the disaster area and not in another city or country. For example, the keyword, Cathedral was mentioned 801 times in the first 6 hours. Although New Zealand has several other places with the same “Cathedral” name such as Cathedral Place in Auckland, Cathedral Court in Hahei, Cathedral Cove in Waikato, the mention of the Cathedral was for Cathedral Square in Christchurch. Therefore by using the frequent mentioned area as a filter we can pinpoint conversations related to smaller areas in-side the crisis area.

Narrowing to important locations

Once we have identified the area, it is time to look for mention of other small areas since in order for the data to be useful for disaster recovery, we may need to look for smaller location entries such as a certain road, hospital, airport etc. We can see that the Cathedral was mentioned the most at the beginning and it was one of the first places that was affected. The word hospital was mentioned heavily between 2 to 2.5 hours. There may be two reasons for such mentions. One is, people are looking for hospital to go to or suggestions that certain hospital is open or closed or is over capacity. Another is, a hospital is also hit at that hour. For the case of this earthquake, Christchurch hospital was getting partly evacuated at that hour due to damage in some areas.

We also see mention of airport in various tweets at different time. By reading the tweets we can find that most of the mentions were either because people were getting unconfirmed news that the airport is hit as well and later they found that airport was not damaged and can be used.

Narrowing to exact places

If we now eliminate the top two mentioned areas among the specific areas, we find that CTV or Canterbury Television Building was mentioned heavily after 18 hours of the earthquake. It also had around 200 mentioned in the first 6 hours. Since 94 of 168 causalities recorded in Christchurch earthquake was from this building let us focus on this a little more. From the collected tweets we can see that first half hour there is only one mention about the building. The number of mentioned did not increase to a noticeable amount for next 2 hours. However starting from third hour more and more information about CTV build-ing starts to appear in tweet stream.

Based on the limited data collected, we can see that it is possible to identify at least bigger areas really quickly as they gets mentioned very frequently. For example, the Cathedral is at the heart of Christchurch and therefore it was extremely well known for people in that area and was heavily mentioned. Furthermore, early footages (images and videos) also contained clips of destroyed Cathedral, which was then retweeted many times.

However, the more specific or smaller areas are mentioned less frequently. Although if one observes the repeated mention of a certain location or specific areas, one can find that that is a potentially dis-aster stricken area – which was the case for CTV building.

Discussion

Based on the analysis, we can suggest that in the absence of access to large data sets, if we are only looking for location information to find out which area re-quires more help, we can still find out names of the places that were hard hit during disaster. Although the small dataset we have at CCI were set up after the hashtag became popular, and therefore missed certain amount of information, it is still appears to be quite useful for location identification from tweets gathered using existing method.

Further research is needed in this area to identify if there are other keywords that indicates location information such as use of “at” or “in” or other preposition as location names will not be available while the disaster is in progress. By using various other combinations it is potentially possible to find the mention of a location even if it is not reported in other medium.

Note: A version of this article was first published in International Conference on  e-Education, e-Business and Information Management (ICEEIM 2013) Beijing, China, March 14-15, 2013.