Web Mining
Introduction
What is Web Mining?
What is Web Mining?
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services.Web mining deals with three main areas: web content mining, web usage mining and web structure mining. In web usage mining it is desirable to find the habits and relations between what the website’s users are looking for. To find the actual users some filtering has to be done to remove bots that indexes structures of a website.
Robots view all pages and links on a website to find relevant content. This creates many calls to the website server and thereby creates a false image of the actual web usage.
The paper we have chosen to start with [Tang et al. 2002] does not in depth discuss web content and web structure mining, but instead look closer upon web usage mining. This field is supposed to describe relations between web pages based on the interests of users, i.e. finding links often clicked in a specific order which are of greater relevance to the user. The patterns revealed will then be used to create a more visitor customized website by highlighting or otherwise expose web pages to increase commerce. This is often demonstrated as a price cut in one product which will increase sales in another. On the other hand it is also important to not to misclassify actual users that make thorough searches of websites and label them as robots.
1. Motivation / Opportunity
The WWW is huge, widely distributed, global information service centre and, therefore, constitutes a rich source for data mining.
01 Personalization, Recommendation Engines.
02 Web-commerce applications.
03 Building the Semantic Web.
04 Intelligent Web Search.
05 Hypertext classification and Categorization.
06 Information / trend monitoring.
07 Analysis of online communities.
2. The Web
01. Over 1 billion HTML pages, 15 terabytes
02. Wealth of information
a. Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes,yellow & white pages, maps, markets, .........
b. Diverse media types: text, images, audio, video
c. Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3
03. Highly Dynamic
a. 1 million new pages each day
b. Average page changes in a few weeks
04. Graph structure with links between pages
a. Average page has 7-10 links
b. in-links and out-links follow power-law distribution
05. Hundreds of millions of queries per day
3. Abundance and authority crisis.
01. Liberal and informal culture of content generation and dissemination
02. Redundancy and non-standard form and content
03. Millions of qualifying pages for most broad queries
Example: java or kayaking
04. No authoritative information about the reliability of a site
05. Little support for adapting to the background of specific users
4. One Interesting Approach
01. The number of web servers was estimated by sampling and testing random IP address numbers and determining the fraction of such tests that successfully located a web server
02. The estimate of the average number of pages per server was obtained by crawling a sample of the servers identified in the first experiment
03. Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740): 107–109.
4. Applications of web mining
a.
a. E-commerce (Infrastructure)
b. Generate user profiles -> improving customization and provide users with
pages, advertisements of interest
c. Targeted advertising -> Ads are a major source of revenue for Web portals
(e.g., Yahoo, Lycos) and E-commerce sites.
Internet advertising is probably the
“hottest” web mining application today
d. Fraud -> Maintain a signature for each user based on buying patterns on the
Web (e.g., amount spent, categories of items bought). If buying pattern changes
significantly, then signal fraud
e. Network Management
f. Performance management -> Annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by a factor of three. Result is
frequent congestion. During a major event (World cup), an overwhelming number
of user requests can result in millions of redundant copies of data flowing back
and forth across the world
g. Fault management -> analyze alarm and traffic data to carry out root cause
analysis of faults
h. Information retrieval (Search) on the Web
i. Automated generation of topic hierarchies
j. Web knowledge bases
5. Why is Web Information Retrieval Important?
a. According to most predictions, the majority of human information
will be available on the Web in ten years
b. Effective information retrieval can aid in
c. Research: Find all papers about web mining
d. Health/Medicene: What could be reason for symptoms of “yellow
eyes”.
e.high fever and frequent vomitting
f. Travel: Find information on the tropical island of St. Lucia
g. Business: Find companies that manufacture digital signal processors
h. Entertainment: Find all movies starring Marilyn Monroe during the
years 1960 and 1970
i. Arts: Find all short stories written by Jhumpa Lahiri
6. Why is Web Information Retrieval Difficult?
a. The Abundance Problem (99% of information of no interest to 99%
of people)
b. Hundreds of irrelevant documents returned in response to a search
query
c. Limited Coverage of the Web (Internet sources hidden behind
search interfaces)
d. Largest crawlers cover less than 18% of Web pages
e. The Web is extremely dynamic
f. Lots of pages added, removed and changed every day
g. Very high dimensionality (thousands of dimensions)
h. Limited query interface based on keyword-oriented search
i. Limited customization to individual users
6. Web Mining Taxonomy
01. Web content mining: focuses on techniques for assisting a user in finding documents that meet a certain criterion (text mining)
02. Web structure mining: aims at developing techniques to take advantage of the collective judgement of web page quality which is available in the form of hyperlinks
03. Web usage mining: focuses on techniques to study the user behaviour when navigating the web (also known as Web log mining and clickstream analysis)
7. Web Content Mining
01. Can be thought of as extending the work performed by basic search engines
02. Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users
03. Web Content Mining is: the process of extracting knowledge from web contents
8. Structuring Textual Information
01. Many methods designed to analyze structured data
02. If we can represent documents by a set of attributes we will be able to use existing data mining methods
03. How to represent a document?
04. Vector based representation
(referred to as “bag of words” as it is invariant to permutations)
05. Use statistics to add a numerical dimension to unstructured text.
9. Text Mining
01. Document classification
02. Document clustering
03. Key-word based association rules
10. Web Search
1.Domain-specific search engines
a. www.buildingonline.com
b. www.lawcrawler.com
c. www.drkoop.com (medical)
d. Meta-searching
e. Connects to multiple search engines and combine the search results
f. www.metacrawler.com
g. www.dogpile.com
h. www.37.com
2. Post-retrieval analysis and visualization
a. www.vivisimo.com
b. www.tumba.pt
c. www.kartoo.com
d. Natural language processing
e. www.askjeeves.com
f. Search Agents
g. Instead of storing a search index, search agents can perform realtime searches on the Web.
h. Fresher data, slower response time and lower coverage.
11. Web Structure Mining First generation of search engines
01. Early days: keyword based searches
a. Keywords: “web mining”
b. Retrieves documents with “web” and mining”
02. Later on: cope with
a. synonymy problem
b. polysemy problem
c. stop words
03. Common characteristic: Only information on the pages is used.
12. Modern search engines
01. Link structure is very important a. Adding a link: deliberate act b. Harder to fool systems using in-links c. Link is a “quality mark”
02. Modern search engines use link structure as important source of information.
1. The Web Structure
a. If the web is treated as an undirected graph 90% of the pages form a single connected component
b. If the web is treated as a directed graph four distinct components are identified, the four with similar size.
13. Some statistics
01. Only between 25% of the pages there is a connecting path BUT
02. If there is a path:
a. Directed: average length <17 -="" 03.="" a="" average="" b.="" it="" length="" s="" small="" undirected:="" world=""> between two people only chain of length 6! 17>
a. High number of relatively small cliques >
b. Small diameter
04. Internet (SCC) is a small world graph.
14. Applications Web mining is an important tool to gather knowledge of the behaviour of Websites’ visitors and thereby to allow for appropriate adjustments and decisions with respect to Websites’ actual users and traffic patterns. Along with a description of the processes involved in Web mining [Srivastava, 1999] states that Website Modification, System Improvement, Web Personalization and Business Intelligence are four majomajor application areas for Web mining. These are briefly described in the following sections.
15. Website Modification The content and structure of the Website is important to the user experience/impression of the site and the site’s usability. The problem is that different types of users have different preferences, background, knowledge etc. making it difficult (if not impossible) to find a design that is optimal for all users. Web usage mining can then be used to detect which types of users are accessing the website, and their behaviour, knowledge which can then be used to manually design/re-design the website, or to automatically change the structure and content based on the profile of the user visiting it. Adaptive Websites are described in more detail in [Perkowitz & Etzioni. 1998].
16. System Improvement The performance and service of Websites can be improved using knowledge of the Web traffic in order to predict the navigation path of the current user. This may be used e.g. for cashing, load balancing or data distribution to improve the performance. The path prediction can also be used to detect fraud, break-ins, intrusion etc. [Srivastava, 1999].
17. Web Personalization Web Personalization is an attractive application area for Web based companies, allowing for recommendations, marketing campaigns etc. to be specifically customized for different categories of users, and more importantly to do this in real-time, automatically, as the user accesses the Website. For example, [Mobasher et. al. 1999] and [Yan et al. 1996] uses association rules and clustering for grouping users and discover the type of user currently accessing the Website (based of the user’s path through the Website), in real-time, to dynamically adapt hyperlinks and content of the Website.
18. Business Intelligence For Web based companies Web mining is a powerful tool to collect business intelligence to get competitive advantages. Patterns of the customers’ activities on the Website can be used as important knowledge in the decision-making process, e.g. predicting customers’ future behaviour, recruiting new customers and developing new products are beneficial choices. There are many companies providing (among other things) services in the field of Web Mining and Web traffic analysis for extracting business intelligence, e.g.[BizInetl, 2011] and [WebTrends, 2011].
19. Summary
01. Web is huge and dynamic
02. Web mining makes use of data mining techniques to automatically discover and extract information from Web documents/services
04. Web structure mining
05. Web usage mining
06. Semantic web: "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila.
Thanks For read My Article.Any Query Comment Below.