Extracting Contiguous Twitter Social Media Data

Posted by: Informatica Enterprise Data Integration

PowerExchange for Twitter provides high performance connectivity to the Twitter social network. This listing allows you to leverage PowerExchange for Twitter and PowerCenter to search, extract, and continuously accumulate tweets.

Connectivity Data Integration Connectors Mappings & Mapplets

Download now

Overview
Features
Support

Overview

The Twitter Block file demonstrates how you can search for tweets containing a specific topic, determine the latest Tweet ID for each session, and run the session repeatedly to continue extracting tweets. The first time you run the session, the Integration Service extracts the tweets for the topic defined in the query string and stores the tweets in the database. The Twitter chains parameter file is populated with the latest Tweet ID. If a scheduler is configured to run the session again, the next session uses the latest Tweet ID in the parameter file to extract the next set of tweets.The demo file contains the following objects: Twitter Mappings :The m_Twitter_chain mapping maps the Twitter source to a target database. The mapping extracts the latest 1500 tweets per session or the last six to seven days of historical tweets, whichever condition is reached first. If the session is configured to run again, it extracts the next set of matching tweets that were created after the previous session.The mapping extracts the tweets and also stores the latest Tweet ID that is used to extract the next set of tweets. Twitter WorkflowsThe wf_m_ twitter_chain workflow contains the m_Twitter_chain mapping. The workflow is scheduled to run repeatedly in intervals of time. Twitter Chain MappingThe mapping m_twitter_chain contains two pipelines to extract contiguous tweets.The mapping includes the following pipelines:

Twitter Entry
Twitter Chain

The following figure shows the m_Twitter_chain mapping:

The Twitter Entry PipelineThe Twitter Entry pipeline is a pass-through pipeline that extracts tweets based on the search criteria. The Twitter Entry pipeline includes the following objects:

Twitter Entry source
Tweets Oracle target

The Twitter Chain PipelineThe Twitter Chain pipeline contains transformations that determine the latest Tweet ID from the Tweets target database and store it in a parameter file. The Twitter Entry pipeline uses the parameter file details to extract the next set of contiguous tweets when the session repeats. Twitter Chain WorkflowThe wf_m_twitter_chain workflow contains the s_m_twitter_chain session and a Start task.The workflow uses a workflow variable, $$Tweet_MAX_ID, in the query string of the Application Source Qualifier to input the search criteria. The workflow variable is defined in the Variables tab of the workflow properties for the workflow. The value of the variable is defined in the parameter file.The Integration Service is configured to use the newline column delimiter in the Twitter_Chain_Params parameter file. Verify the parameter file name and directory in the workflow properties. Configure a scheduler to run the workflow in intervals of time, each time extracting a contiguous set of tweets. Twitter Connections :Configure the application connections in the Workflow Manager before you run the social media sessions. Verify the connection to the target database. Search Criteria Configuration :When you configure a session for a Twitter source, you specify the query string that the Twitter API uses to search for the social media data.The query string is defined in the Application Source Qualifier for the Twitter source in the session s_m_twitter_chain.The query string has the format twitter since_id:$$Tweet_MAX_ID and contains the following parameters:

The default search topic "twitter" that generates a search result of all the tweets that contain the topic "twitter".
The Twitter Search API parameter, since_id, that returns results with a Tweet ID more recent than the specified ID.
The workflow variable, $$Tweet_MAX_ID, that takes the value as defined in the parameter file. It specifies the Tweet ID for the since_id parameter.

You can download this listing as part of the Informatica for Social Media bundle.

Features

PowerExchange for Twitter 9.1.0 Hotfix1 and later.