Comparative study of Architecture for Twitter Analysis and a proposal for an improved approach

Food & Beverages

6 pages
5 views

Please download to get full document.

View again

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Comparative study of Architecture for Twitter Analysis and a proposal for an improved approach
Transcript
  Comparative study of Architecture for Twitter Analysis and a proposal for an improved approach B.Molnár, Z. Vincellér Information Systems Department Eötvös University of Budapest, Faculty of Informatics Budapest, Hungary {molnarba, vzoli}@inf.elte.hu  Abstract   —  A survey of software and technology architecture about systems dedicated for analysis of Online Social Networks (OSN) will be presented. Based on the comparison and own experiments a novel approach proposed. The   steps of experiments are described. The proposed solution is applied for Twitter and Facebook.    Keywords— On-line Social Network; Architecture; Software  Architecture; Analysis; Data Mining I.   I  NTRODUCTION Investigation of OSN (On-line Social Network) or Social  Network Service (SNS) form various, disparate viewpoint have  become popular in academic circles. The reason is that OSN/SNS emerged as an important media for spreading of information, supporting web content localization, opinion dissemination, information exchange. The academic interests for researching OSN/SNS appeared as several phenomena can  be studied through social networking, the OSN/SNS can be used to examine issues of social sciences. However, the OSN /SNS systems are inherently IT and information systems they  provide opportunity for experimenting and developing algorithms, methods, models and architectures. So far, the  proposed architectures  for OSN (On-line Social Network) or social Network Service (SNS). search and analysis have not  been studied profoundly. There are several papers describing the suggested analysis methods, algorithms and architecture solution. Section II of this paper provides an overview of relevant concepts of systems’ architecture. Section III presents the literature review of developed system architecture in various approaches. This description involves business, Information and Data, Technical architecture outline, furthermore in data and technical architecture tier about the method for data gathering and storage. Section IV presents the results of comparisons. Section V. describes a software architecture that draws on our own experiences, experimental designs and the conclusion of assessment. Section VI summarizes the paper. II.   I  NFORMATION S YSTEM A RCHITECTURE The  Information System Architecture  represents the structure of certain building blocks of information processing systems, their relationships among constituents, the technological approaches and requirements of analysis systems that the main purpose is to support research for either social science or informatics. To understand the various proposed software architectures that have been used; the Zachman ontology and TOGAF approach ([10], [11]) can assist in understanding the relationship between the various  perspectives, aspects, components and models. An Information Systems Architecture can be divided into several levels: a)    Business (systems) architecture - Defines the structure and content (information and function) of all  business systems in the organization; in a research environment the process and tasks of research, and the  business processes that may exploit the research result can be  placed into this architecture tier  . b)    Information (or Data) Architecture – represents main data types that support business; furthermore the structure (including interdependencies and relationships) of information required and in use by the organization. In our case, the data models and database schemas of specific OSN system consist of the data architecture tier, the generalized and meta-data models, and furthermore the ontologies and meta-ontologies of the particular systems can be considered as further extension and refinement of data architecture towards information architecture.  c)    Application Architecture – defines applications needed for data management and business support; the collection of relevant decisions about the organization (structure) of a software system, and the architectural style that guides this organization . In the case of OSN/SNS analysis system, the various algorithms and their software implementation, the services to generate the results comprise the application tier. d)   Technical Architecture – represents the major technical and IT solutions used in application implementation  and the infrastructures that provide an environment for information system deployment. Technical architecture describes and maintains the integrity of the hardware ,  software , and infrastructure environment required to support This work was partially supported by the European Union and the European Social Fund through project FuturICT.hu (grant no.: TAMOP-4.2.2.C-11/1/KONV-2012-0013). 11 CogInfoCom 2013 • 4th IEEE International Conference on Cognitive Infocommunications • December 2–5, 2013 , Budapest, Hungary 978-1-4799-1546-0/13/$31.00 ©2013 IEEE  the Business Systems Architecture and Information Systems Architecture tiers. e)   Control Architecture We have to look at the control architecture that provides a time dimension on the impact of changes that manifest in data, application, and technology. The development and refinement of algorithms to support research leads to the necessity to handle the related software versions and releases.  f)    Developmental control handles all the changes occurring in the process of new application development over time. Also called version control or software configuration management, developmental control records various aspects (who, what, and when) of each change made.  g)   Operational control concerns the performance and integrity of current data, applications, and technology configuration. h)    Maintenance control’s main goal is to keep running and operating the system under changes of volume of data and the environment; in spite of either the modification of applied algorithm, data model or the alteration in collected data and their accessibility from the particular OSN/SNS system.   In this section we overview the researches that reported research results about the system architecture that were applied for investigation of OSN/SNS systems beside the the twin research subjects as analysis of the contents of micro-blogs and  posts, discovering the structure of relationships represented by connections, , sentiment analysis etc. A software architecture is described for Twitter analysis ([2]). a)   Business (systems) architecture : The goal is to collect automatically tweets within well-defined geographical areas and to provide the opportunity for analysis without the restriction of real-time requirements and offer services for end-user to look at retrieved tweets.  b)   Information (or Data) Architecture : MySQL database for relational data model. For text mining, WordNet lexical database and Slang database. Geographical and spatial information is managed by PostgreSQL database server and by its PostGIS spatial extension together with GeoDjango. The solution is a hybrid architecture that combines the Lucene’s advanced indexing schemes and relational data base approach. CSV files (comma separated values) c)   Application Architecture : The search engine dedicated to basic retrieving capability from tweets using tweets’ text, tweeter’s name, location, postcode etc. As Twitter uses for authentication and identity management the OAuth mechanism for this reason the project used it as well. The model-template-view (MTV) was applied as software architectural pattern. Whereby the representation of domain specific data and business logic can be fully separated. Google Maps Javascript API and Django web application is used to display retrieved tweets to end users. The data processing architecture explicitly separate the tasks of data collection and the data analysis, thereby eliminating the strict real-time performance and  processing requirements. d)   Technical Architecture : Python-based Web system framework Django ([20]) together with Apache Lucerne ([18]), Streaming API ([19]). Scala as functional programming environment. Integrating with Google maps, Javascript API, Apache SOLR search engine (full text search), Haystack search for Django. e)   Developmental control : The selected architecture  building blocks comply with open standards. f)   Operational control : To support operation the diversity based fault tolerance ([21]) architecture was selected. g)   Maintenance control : The selected architecture  building blocks are open source systems. Data analytics architecture is described for research on crisis informatics ([1]) to help develop an adequate solution for monitoring social networks and later on analyze the data that were created during crisis events. a)   Business (systems) architecture : The system was  planned to assist in research on seeking and exchange of information by people during crisis situation. The software architecture and engineering research’s goal is to connect the sophisticated technical approaches to the social science researches sociology human-centered computing (H CC) .  b)   Information (or Data) Architecture:  MySQL and Lucene were used for relational database and semi-structured document handling. The database transactions were configured by Spring. A Twitter user mapped to an object and it is actualized by tweets belonging to this particular single user. JPA 2.0 standard makes allowances for using different relational database. Hibernate supports object-relational mapping. Hibernate Search and Lucene together provide an indexing mechanism to speed up the objects’ retrieval. Through JPA and a configuration file, an architecture layer of persistence is implemented. There are persistence and domain layers. h)   Application Architecture:  The system aimed at keyword search, and the follower graph, i.e to explore the Twitter friends and their followers. REST  based web services that allow for both browser and  programmatic access. The REST based web service offers interface for desktop applications, command line tools and even iPad/iPhone applications. The Spring assisted to define the transaction “boundaries”, the scope of queries  in term of social graph, i.e. to connect a collection of Twitter users into a community . The high-level services are concurrently running services in a high performance environment and safeguarding the database consistency at the same time. 12 B. Molnár and Z. Vincellér • Comparative study of Architecture for Twitter Analysis and a proposal…  c)   Technical Architecture : The selected building  blocks were open source framework systems but  production level quality as: Spring & Spring MVC (  Model-View-Controller), Hibernate as Java Persistence API (JPA) provider. As platforms and technological component of architecture, the Tomcat was applied. d)   Developmental control : A hybrid transaction framework that eases the definition of transaction than through programmatic interfaces. e)   Operational control : The hybrid transaction framework yields more control over the transactions as the declarative approach, e.g. transaction callback. f)   Maintenance control : Open source environments are selected as basic architecture building block and hybrid transaction framework to help the various maintenance tasks. Twitter analytics architecture is proposed for research to develop stochastic models of Twitter data ([12]). The goal of research is to investigate the characteristic of message arrivals. Because shortage and non-availability of free data, an appropriate architecture should be defined that solves the data gathering problem to study the descriptive and other statistics of the stochastic processes. a)   Business (systems) architecture : Research process to  be supported is to create stochastic models tweeting, i.e to understand and model the time interval between arriving of tweets and the frequency of re-tweeting. a)   Information (or Data) Architecture : MySQL database; the data gathering process is made use of REST based and Twython API, semi-structured data in XML files. MySQL database is used by the PHP web services and the PYTHON based tweet gathering  program simultaneously.  b)   Application Architecture : The system was planned to provide tweet collection and search functionalities. The Twython API, „ ShowUser” , is used to determine location of tweets, the service gives the city and the state but not geo-coordinate. Twython API yielded information about some information about location of tweets, however it is not satisfactory as the information cannot be used for analysis of latitude/longitude, for this reason they use a Restful API web service provided by Yahoo. For statisctical analysis, MATLAB is used by interfacing to Python through an API. c)   Technical Architecture : As the logical component of the tweet collection functionality, the proposed system used: Twitter APIs (Streaming API, Search API, Rest and Twython APIs). The programming environment was based on Python and SQL, the logical platforms was PHP and Apache. The physical architecture component was Dell Server Power Edge running Windows Server 2003. d)   Developmental control : The programming environment of Python has been exploited for quick modification and adjustment of software. MySQL can interchange data easily with several languages of various development environment as e.g. Python. Python and MySQL can be installed on several, different operating systems, as families of Linux, UNIX etc. e)   Maintenance control:  The logical architecture component has been based on open standards and open source environments. The TwitInfo system dedicated to analyzing micro blogs for event identification, visualization and summarization. ([13]). a)   Business (systems) architecture : A „dashboard” is created to depict the summarization of events inferring from an algorithm for event detection and timeline textual information, the algorithm intention is to collect and analyze huge amount of micro-blogs.  b)   Information (or Data) Architecture : A database is applied together with Django that provides an appropriate indexing mechanism through keywords, and timestamps that are stored and retrieved from database. c)   Application Architecture : The system provides services to gather and to discover tweets in real-time and to carry out sentiment analysis, to connect tweets and events and to extract meta-data e.g. URI/URL. The system logs the geocodes, the geolocation tags or, in the case of nonexistence of geo-tags, attempts to translate the textual information into geographical information as latitude/longitude. The  Google Visualization API   is made use for displaying information on map by the Google Maps API  . TwitInfo used the Twitter API for finding keywords for events d)   Technical Architecture : The logical components are Twitter API, Google Visualization API, Google The Tweetgeist, Statler are systems for text mining on short textual messages during live media events. ([14], [15]). a)   Business (systems) architecture : Application of existing methods for analyzing micro-blogs and short messages linked to live media events then render the hints for end-users about the collected contents of information.  b)   Application Architecture : There is a twofold approach, one for real-time usage the other one for  post-event watching of previous media broadcast. In the first case, there is a real-time feedback, concurrently with the event, to provide the opportunity to monitor the simultaneously created short messages. In the second case, the content of messages are analyzed and used up retrospectively and the system gives hints the area of interests to ease the navigation for users. The Eddi system is for analyzing micro blogs through interactive browsing ([16]). a)   Business (systems) architecture  The process to be supported is that the users want to see the relevant topics extracted form short messages or microblogs, the users want to monitor significant but uncommon topics, although they want to leave out the common topics that are not required. The business process 13 CogInfoCom 2013 • 4th IEEE International Conference on Cognitive Infocommunications • December 2–5, 2013 , Budapest, Hungary  main aim to discover “eddies” within the Twitter streams as analogues phenomenon to eddies in stream of water. A browser interface is offered for public to look through tweets.  b)   Information (or Data) Architecture .Search engine is perceived and utilized as knowledge-base. The shor text messages are changed to fit to the search-engine as a knowledge-base concept. Dashboard view of topics is given that displays the topic tag cloud. c)   Application Architecture  A browser interface and tag cloud is created to describe the most important topic within Twitter streams. To generate tag cloud, a topic clustering algorithm was designed to explore the short messages and to spot the relevant topics with the support of some linguistic tools and search engines. The topic-centered browser categorizes the short messages to establish consistent flow of dialogues. Methods that are used:   Term frequency (TF), inverse document frequency (TF-IDF), topic modeling, Latent Dirichlet Allocation (LDA) ([22]), clustering. For clustering, the faceted browsing approach was selected that demonstrated good  performance against hierarchical browsing ([23]). d)   Technical Architecture: Search engine: Yahoo! Build Your Own Search Service, or Y!BOSS. III.   C OMPARATIVE STUDY OF SYSTEMS ’  ARCHITECTURES FOR ANALYZING OSN/SNS   The issues to be solved in the case of data gathering and analysis of semi-structured information contained in the OSN/SNS systems are as follows: (1) to extract data and information through some available services of OSN/SNS. The data extraction is frequently constrained either by artificial rules or by the restriction of IT environment in term of time, volume, access right etc. (2) If there is a business needs for real-time evaluation it sets up serious  performance requirements  when the aim is to collect all element of information flow from an OSN/SNS system to monitor and  promptly analyze. (3) The difficulties of text mining are the features of textual information of OSN/SNS systems. There are  profound differences among the OSN/SNS systems as e.g. Twitter rather restricted length of text or the wide use of slang, or domain-specific word of social networks. Moreover the interdependencies between the various messages should be taken into account if any semantically rich interpretation is required from the analysis, namely either the topic of texts or the relationship representation among the messages or posts. (4) As the structure of links between posts at social sites differs overwhelmingly from the connections of traditional web sites that leads to the issues of data storing to be solved in a distributed manner and the questions of performance, efficiency and effectiveness and the costs of retrieving specific data under these circumstances. IV.   E XPERIMENTAL WORK  :   T HE PROPOSED D ATA COLLECTION ARCHITECTU RE   The OSN/SNS systems can be considered as challenges from several viewpoints as software architecture, design, handling structured and semi-structured data or documents, furthermore keeping in hand large volumes of data. Fig. 1.   The basic data flow at Technical Architecture tier of logical components- the sampling architecture To understand the various proposed systems and their architecture, we have selected the Zachman and TOGAF framework ([10], [11]) to juxtaposition the different property of systems, and to make it comparable and interpretable. After the assessment of systems, we designed an architecture that was drawn on the results of evaluation. The results of comparative study can be summarized by the following way. The majority of proposed software architecture preferred open source systems. The primarily semi-structured data were manipulated  by on one hand relational structure technology, on the other hand key-value or No(t only) SQL technology was used for efficient end effective indexing. The physical hardware components were basically COTS, thereby some proposed software architecture built in development environments as Lucene, Hibernate to accomplish better performance for data handling. The systems and research that have been studied concentrated on content analysis an as a side-effect discussed and researched the architecture. The papers dedicated to analytics of graph structure of SNSs touched only superficially the issue of software architecture therefore they were not included into the literature review and comparative study. One of the specificity and area of interests of Twitter is that the Twitter data are not only set of data items and elements of texts but through the contained data items a complex network can be built up. One of the structure is the follower graph (tweeters that follows specific other users), the other looser relationship is created by the mechanisms of re-tweet and hash-tag. The previously mentioned network structures can be mapped to graph structures but to discover the basic organizing  principles of sub-graphs leads to challenging graph theoretic questions and empirical investigations. The huge amount of semi-structured Twitter, Facebook etc. data requires approaches that can handle very large databases. Both the storing and processing of large volume semi-structured data makes necessary to exploit the services of most modern data  base and programming technologies. In Social Network Analysis (SNA), the subject of research and investigation is to examine relations, links, patterns of information exchange. A suitable graph representation can support finding specific behaviors and construction of networks within social sites. Beside Twitter, Facebook has achieved enormous success, it has overtaken even Google in terms of visitors or users. In contrast to Facebook ( and other 14 B. Molnár and Z. Vincellér • Comparative study of Architecture for Twitter Analysis and a proposal…  OSNs as e.g. Myspace, Badoo, Orkut etc.) , Twitter does not set up barriers in the form of privacy control, in this sense collecting the relationship information a little bit easier. Facebook has restriction on access rights, i.e. a valid user account is needed to enter into the system through browser or API. A research question can be formulated as whether the relationship graphs of two deeply differing SNSs as e.g. Twitter and Facebook provide some clues about the evolution of networks, features of users’ behavior that can be spotted by graph structures. A software architecture is built up based on our own previous experiences and results of other researches (Fig. 1). T ABLE 1.   Z ACHMAN ARCHITECTURE ’ S RELATIONSHIP TO OSN   /SNS  ANALYSIS SYSTEMS   J. A. Zachman S. H. Spewak Entities= what   Data Architecture Activities= how  Applications Architecture Locations= where Technology Architecture People= who Time= when Motivation = why  Planner Objectives/Scope (Contextual) List of Business Objects Subject and of research List of Business Processes The research process and the anticipated  process as a result in the form of service for  public List of Business Locations List of Organizations important to the Business The entities in society that plays a role in the research. List of Events Significant to the Bus.  Major Events that have  some role in OSS/SNS and the research List of Bus. Goals/Strategies Ends/Means  Major goals of research ScopeOwner Enterprise Model (Conceptual) Semantic Model Object Class,  Association, Ontology Business Process Model Web Services,  Documents Business Logistics System  Acquiring method of data from OSN/SNS for research Work Flow ModelPeople What contribution is  provided by human resources to research and how can make use of the the created  system for their own  purposes  .  Documents  Master Schedule The real-world events  generate occurrences in OSN/SNS in the  form of the generated messages The schedule of monitoring, tracking, data collecting. Business Plan The goal of research to be achieved Enterprise ModelDesigner Information Systems Model (Logical) Logical Data Model Structured and semi- structured model; Data  Entity, Relationship Application Architecture  Application Function,   Web Services , I/O=User Views, Semi- structured documents  System Geographic Deployment  Architecture e.g.  Distributed System  Arch.  Node=I/S Service. (Processor, Storage,  Logical Application Component. etc.) Link=Relationship between Logical Appl. Comp. Human Interface Architecture People=Role Work=Deliverable, Semi-structured documents  Processing Stucture Time=System Event, Orchestration Cycle=Processing Cycle Business Rules  Ends=Structural  Assertion,  Means=Action  Assertion, System ModelBuilder Technology Model (Physical) Physical Data Model  Ent.=Segment/Table/et c  Reln=Pointer/Key/etc System Design Proc.=  I/S Services  I/O=Data Elements/Sets,  XML /  HTML documents  System Architecture/Technology Architecture  Physical Application Comp.  Node=Hardware/Systems SoftwareLink=Line Specifications Presentation Architecture People=Screen Format,  HTML / XML interface Work=User Control StructureTime=Execute, Choreography  Cycle=Component Cycle Rule Design Ends=Condition Means=Action Technical ModelSubcontractor Detailed Specifications (Out-of-context) Data Definition Repository  Ent.=Field  Reln=Address Programs Supporting Software Components  Proc.=Language Statement  I/O=Data Item, XML  Field  Network Architecture  Node=Address  Link=Protocol Security ArchitecturePeople=  Identity ,  Authentication ,  Authorization , Work=Job Timing Definition Time=Interrupt Cycle=Machine Cycle Rule Specification Ends=Sub-condition Means=Step Compo-nentsFunctioning Enterprise Data Function NetworkOrganizationScheduleStrategy However, Twitter as social network has a more complex construction, namely it has concepts of relationship as „following”, „reply to” and „mention”. That leads to a more complex graph representation that requires directed differentiated edges between vertices e.g. by color. For data gathering, both SNSs offer an API that conveys the semi-structured data from the networks. Instead of open source relational database management system, we have chosen the Oracle RDBMS as it is available for research. Streams out of  both SNSs are transformed into relational tables for further  processing. Two data models and database schema is defined for Twitter messages and Facebook micro-blogs, respectively. One of the reason is to look for a proper software architecture is that the system should support the research on graph theoretical properties of both SNSs when COTS hardware environment is used. The emerging performance problems considering time and storage can be kept in hand using the HADOOP physical data structure architecture and the MapReduce massively parallel data processing approach. A relational data model of Twitter messages and Facebook micro-blogs is created by mapping the data in the form of XML and JSON into relation schemas. Because of dissimilarities between the two networks, two disparate data models had been created, although the SNA requires a unified view of two networks for studying the similarities and differences in terms of sub-graphs, the mutual mapping of entities playing roles in both networks etc. Whereby, the 15 CogInfoCom 2013 • 4th IEEE International Conference on Cognitive Infocommunications • December 2–5, 2013 , Budapest, Hungary
Advertisement
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks