Web Science

Option C in the IB Computer Science course.

Creating the Web

Commonly the Internet, an internet, and the World Wide Web (otherwise referred to as the web) have been commonly mixed up. However, each is quite different.

C.1.1 Distinguish between the internet and World Wide Web (web).

An internet simply refers to a set of interconnected networks. The Internet refers to the global computing network that utilizes standardized communications protocol including IP addresses. In other words, the internet is a wide- area network that spans the planet^[1]. The World Wide Web (Web) is the information space comprised of various web resources that can be accessed via the Internet. In other words the World Wide Web is a service that runs on The Internet.

The analogy can be made that the Internet is a restaurant and the web is its most popular dish.

Growth of the Web

C.1.2 Describe how the web is constantly evolving.

Generally, it can be characterized that the change in the web was a movement from personal sites to blogs, or publication to participation. It was a move from static pages to dynamic ones.

Early Forms of the Web

Sometimes referred to by "Web 1.0", early stage's of the web where personal and static web pages hosted on ISP (internet service provider) web servers or on free web hosting services. Generally before the advent of dynamic programming languages such as Perl, PHP, and Python, some design elements included: online guestbooks instead of comment sections and HTML forms were mailto forms.

Web 1.0 is associated with the business model of Netscape - focusing on software creation, updates, and bug fixes and the distribution of such to end users.

Web 2.0

Web 2.0 referred to a web that emphasized user participation and contribution in sites such as social media sites and blogs. Featured client-side technologies such as Ajax and JavaScript as well as dynamic programming languages. The focus on user interface, application software, and storage of files has been referred to as "network as a platform". Key features of Web 2.0 include:

Folksonomy - free classification of information (such as in tagging)
User Participation - site users are encouraged to add value/content to the site
Mass Participation - universal web access has led to the differentiation of concerns from the user base
SaaS (Software-as-a-Service)

In contrast to Web 1.0, Web 2.0 is associated with Google, which focused not on creating end-user software but providing a service based on existing data.

The Semantic Web

The Semantic Web was extended through the standards by the World Wide Web Consortium (W3C) that promoted common data formats and a unity in exchange protocols. For example, the Resource Description Framework (RDF) specification was promoted as a general method for conceptual modelling for web resources using subject-predicate-object expressions (e.g. subject: "the table", predicate: "has the length of", object: "one meter").

Protocol and Addressing

C.1.9 Explain the importance of protocols and standards on the web.

Protocols are a set of rules for communication that ensure proper, compatible communication for a certain successful process to take place e.g. TCP/IP. Protocols ensure the universality of the web. Standards, on the other hand, are a set of technical specifications that should be followed to allow for functionality but do not have to be necessarily followed in order to have a successful process to take place e.g. HTML. Without them, it would be like communicating in a foreign language without knowing the foreign language.

e.g. without TCP, there would be no transport protocol and packets would be lost.

e.g. without HTML, there would be no standard scripting language for displaying webpages and different web browsers may not display all pages^[2]

Web Browser

C.1.12 Explain the functions of a browser.

A software tool for retrieving, presenting, and traversing information resources on the web.

C.1.7 Identify the characteristics of: IP, TCP, and FTP.

TCP and IP together comprise a suite of protocols that carry out the basic functionality the web.

Internet Protocol (IP)

IP is a network protocol that defines routing to addresses of the data packets.^[1] Every computer holds a unique IP address and IP ensures the process of getting all data to the destination.

Transmission Control Protocol (TCP)

Information sent over the internet is broken into “packets” and sent through different routes to reach a destination. TCP creates data packets, puts them back together in the correct order, and checks that no packets were lost.. '

File Transfer Protocol (FTP)

FTP is the protocol that provides the methods for sharing or copying files over a network. It is primarily utilized for uploading files to a web site and certain downloading sites may utilize an FTP server. However, HTTP is more common for downloading. When using FTP, the URL will reflect as such with ftp:.

C.1.3 Identify the characteristics of: HTTP, HTTPS, and URL.

Hypertext Transfer Protocol (HTTP)

HTTP is a specific set of internet protocol used to communicate between web servers and web browsers. HTTP is a text based protocol as a new connection must be established for each new user request and communicates without knowledge of the communications network.

Hypertext Transfer Protocol Secure (HTTPS)

As HTTP does not provide much security, HTTPS was developed and added encryption to a connection using Transport Layer Security (TLS) or Secure Sockets Layer (SSL).

Uniform Resource Locator (URL)

A standard way of specifying the location of a webpage.^[1]

Uniform Resource Identifier (URI)

A means of identifying a specific webpage on a website.

C.1.4 Identify the characteristics of: uniform resource identifier (URI) and URL.

URLs have typica characteristics which are in the URI.

For example, the URL htt, p://example.com/page/resource, has aaddress, n protocol identifiefor retrieval r http, resource name is example.com and a specific file name.

C.1.5 Describe the purpose of a URL.

A URI is a string that identifies a resource. A URL is specific type of URI that provides the address of a web resource as well as the means to retrieve the resource. For example, http://example.com/index identifies http protocol for retrieval, example.com as the address, and the specific file /index.

Domain Name Server (DNS)

C.1.6 Describe how a domain name server functions.

A Domain Name Server is a special type of server that relates a web address to an IP address, acting somewhat like a directory. It utilizes a hierarchical decentralized naming system, sorting by root DNS servers or top level domain servers (such as .net and .com) then to authoritative DNS servers below each top level (for example, .stanford may be under .edu).

Mark-up and Style Sheets

C.1.3 Identify the characteristics of the following: HTML, XML, XSLT, Javascript, and CSS.

Hypertext Mark-up Language (HTML)

HTML is the standard markup language used to make web pages. Characteristics:

Allows for embedded images/objects or scripts
HTML predefined tags structure the document
Tags are marked-up text strings, elements are “complete” tabs, with opening and closing, and attributes modify values of an element
Typically paired with CSS for style

Cascading Style Sheet (CSS)

CSS sheets describe how HTML elements are displayed. It can control the layout of several web pages at once.

Extensible Mark-Up Language (XML)

XML is a markup specification language that defines rules for encoding documents (to store and transport data) that is both human- and machine- readable. XML, as a metalanguage, supports the creation of custom tags (unlike HTML) using Document Type Definition (DTD) files which define the tags. XML files are data, not a software.

Extensible Stylesheet Language Transformations (XSLT)

XSLT is a language for transforming XML documents into other XML documents or other formats such as HTML. It creates a new document based on the content of the existing one.

Javascript

JavaScript is a dynamic programming language widely utilized to create web resources. Characteristics include:

Client side
Supports object-oriented programming styles
Does not include input/output
Can be used to embed images or documents, create dynamic forms, animation, slideshows, and validation for forms
Also used in games and applications

Web Pages

C.1.8 Outline the different components of a web page.

Head contains title and meta tags, metadata. Metadata describe the document itself or associates it with related resources such as scripts and style sheets. Body contains headings, paragraphs and other content.

Title defines the title in the browser’s toolbar.

Meta tags are snippets of text that describe a page’s content but don’t appear on the page itself, only in the page’s code. Helps search engines find relevant websites.

C.1.10 Describe the different types of web page.

Personal pages are pages created by individuals for personal content rather than for affiliations with an organization. Usually informative or entertaining containing information on topics such as personal hobbies or opinions.

Blogs or Weblogs is a mechanism allowing for publishing periodic articles on a website.

Search Engine Pages or Search Engine Results Page (SERP) display results by a search engine from a query.

Forums or online discussion boards usually organized by topics where people can hold conversations through posted messages. Typically has different user groups which define a user’s roles and abilities.

C.1.11 Explain the differences between a static web page and a dynamic web page

Static web pages contain the same content on each load of the page, but dynamic web pages’ content can change depending on user input. Static websites are faster to develop and cheaper to develop, host, and maintain, but lack the functionality and easy ability to update that dynamic web sites have. Dynamic web pages include e-commerce systems and discussion boards.

Dynamic web pages can use PHP, ASP.NET frameworks or Java Server Page (JSP) scriptlets. JSP scriptlets are a small piece of executable code intertwined in HTML. JSP is server-side. JavaScript on the other hand is client-side. ASP.NET framework can use simple pages (SPA) or MVC (Model View Logic) models to generate dynamic web pages or applications, hosts a variety of .NET languages such as razor syntax C#. PHP is server-side scripting for web development and can be embedded into HTML code or used with templates or frameworks.

C.1.13 Evaluate the use of client-side scripting and server-side scripting in web pages.

Server-side scripting runs on server, requires a request sent and return data. More secure for client. Includes PHP, JSP, and ASP.NET.

Client-side scripting runs script on client’s side. Can pose security risk to client, but faster. Includes JavaScript and JSON.

C.1.14 Describe how web pages can be connected to underlying data sources.

Connection strings is a string that specifies about a data source and the means to connect to it. Commonly used for database connection.

C.1.15 Describe the function of the common gateway interface (CGI).

CGI is a standard way for web servers to interface executable programs installed on a server that generate web pages dynamically.

Searching the Web

Layers of the Web

C.2.2 Distinguish between the surface web and the deep web.

Surface web is anything able to be found and accessed by search engines. The deep web includes web pages that cannot be found by search engines due to protection through need of authentication. Can usually only be accessed by already knowing the link or having the proper authentication. The dark web on the other hand can usually only be found through TOR as access requires encryption and anonymization factors.

Search Engines

C.2.1 Define the term search engine.

Web search engine is a site that helps you find other websites through methods such as keyword searching and concept-based searching. Searches through following the different links of a website.

Searching Algorithms

C.2.3 Outline the principles of searching algorithms used by search engines.

The term searching is about looking at the queries that have been entered and the index is searched for matches. These things are taken into account when searching: checking term frequency, zone indexes (placing different weight on title v. description), relevance of feedback, vector model (looking at the cosine similarity of a document).

PageRank is an algorithm used by Google. Link analysis algorithm that assigns numerical weighting to each element of hyperlinked texts. PR(E) (page rank of E). A hyperlink to a page counts as a vote or support of a particular page. Importance by association. Number of paths to the page divided by number of outgoing links from the page/step before and then considering the PR of the previous page/step. Altogether, the different PageRanks would sum 1, its a probability distribution.

Hyperlink-Induced Topic Search (HITS) algorithm is a link analysis program that also rates Web pages. Hubs and authorities. A good hub points to many pages, a good authority is a page linked to by many hubs. Each page is assigned two scores: its authority, which estimates value of content, and its hub value, which estimates the value of its links to other pages. First generates a root set (most relevant pages) through text-based algorithm. Then a base set generated by augmenting the root set with web pages linked from it or to it. The base set and all the hyperlinks in the base set form a focused subgraph upon which HITS is performed.

Web Crawlers

C.2.4 Describe how a web crawler functions.

Web crawlers, also known as web spiders, are internet bots that systematically index websites by going through different links while collecting information about the site. Also copies the site for index.

Bot also known as a web robot is a software application that runs automated tasks or scripts over the Internet and can do so at a high rate. Usually repetitive tasks.

Web crawlers can be stopped from accessing a page with a robots.txt file through robot exclusion protocol.

C.2.5 Discuss the relationship between data in a meta-tag and how it is accessed by a web crawler.

Meta tags are used for indexing for keywords, retrieval (if index is relevant to search query), and may sometimes be used for ranking. Google for example, gives meta tags no weight. Students should be aware that this is not always a transitive relationship.

C.2.6 Discuss the use of parallel web crawling.

Crawling is a process of exploration of every link page and returning copy of that page.^[3] Use of several web crawlers or running multiple processes in parallel at once to maximize download rate. Has to be careful not to download the same site more than once.

C.2.7 Outline the purpose of web-indexing in search engines.

Indexing is a process where each page is analysed for words and then the page is added to an index of websites.^[3] Indexing allows for speedy searching and to provide high relevancy.^[3]

C.2.9 Describe the different metrics used by search engines.

Trustworthiness of linking domain/hub
Popularity of linking page
Relevancy of content between source and target page
Anchor text used in link
Amount of links to the same page on source page
Amount of domains linking to target page
Relationship between source and target domains
Variations of anchor text in link to target page

Search Engine Optimization

C.2.8 Suggest how web developers can create pages that appear more prominently in search engine results.

Allow search engines to find your site
Have a link-worthy site
Identify key words, metadata
Ensure search-friendly architecture
Have quality content
Update content regularly

C.2.11 Discuss the use of white hat and black hat search engine optimization.

Black hat use aggressive SEO strategies that exploit search engines rather than focusing on human audience - short term return. Include usage of:

Blog spamming
Link farms
Hidden text
Keyword stuffing
Parasite hosting
Cloaking

White hat techniques are “within” guidelines and considered ethical - long term return.

Guest blogging
Link baiting
Quality content
Site optimization

Distributed Approaches to the Web

Future of an Interconnected Web

C.3.1 Define the terms: mobile computing, ubiquitous computing, peer-2-peer network, grid computing.

Mobile Computing

Mobile computing is human-computer interaction during which the computer can be expected to be transported during normal usage (or otherwise is mobile). Most popular devices include the smart phone and the tablet.

Ubiquitous computing

Ubiquitous computing is the concept where computing is made to appear anytime anywhere. An overwhelming spread of computing (pervasive computing). It comes in different forms e.g. laptops, tablets.

Peer-2-Peer Networks

Peer-2-Peer Networks are ones in which each computer or node acts as both client and server which allows for resources to be commonly shared by all within the network. Autonomy from central servers achievable. An example of P2P is torrenting.

Grid Computing

Grid computing is the collection of computer resources in multiple locations to reach a common goal. Distinguished from cluster computing in that grid computing assigns specific roles to each node. Grids can be used for software libraries. Persistent, standards-based service infrastructure.

C.3.2 Compare the major features of: the above

Ubiquitous computing is being perpetuated by mobile computing. The idea is spreading and manifesting.

P2P addresses is more about assuring connectivity and a network of shared resources, while grid network focuses more upon infrastructure. Both deal with the organization of resource sharing within virtual communities.

Ubiquitous computing commonly are characterized by multi-device interaction (P2P and grid), but are not necessarily synonymous.

Grid in grid computing links together resources (PCs, workstations, servers, storage elements) and provides mechanism needed to access them.

Interoperability and Open Standards

C.3.3 Distinguish between interoperability and open standards.

Interoperability is a property of a system to work with other products without any restrictions in access or implementation.

Open standards is a standard publicly available and has various rights to use associated with it.

C.3.4 Describe the range of hardware used by distributed networks.

Peer-to-peer: architectures where there is no special machines that provide a service or manage the network resources. Instead all responsibilities are uniformly divided among all machines, known as peers. Peers can serve both as clients and as servers.
Client–server: architectures where smart clients contact the server for data then format and display it to the users. Input at the client is committed back to the server when it represents a permanent change.
Three-tier: architectures that move the client intelligence to a middle tier so that stateless clients can be used. This simplifies application deployment. Most web applications are three-tier.

C.3.5 Explain why distributed systems may act as a catalyst to a greater decentralization of the web.

A distributed system is a software system in which components located on networked computers which communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal and thus this causes to have everything on other computers and not to make a computer 'boss' which is a head as all of them are on the same level.

Compression

C.3.6 Distinguish between lossless and lossy compression.

Lossless recovers every single bit of original data when decompressed (GIF).

Lossy eliminates redundant or certain information. (JPEG)

C.3.7 Evaluate the use of decompression software in the transfer of information.

It can be only used with Lossless compression.
It is helpful if you do not have the original file.
It might not bring every bit back and some minor details might be missing.

The Evolving Web

C.4.1 Discuss how the web has supported new methods of online interaction such as social networking.

Web 2.0 and the increase of dynamic web pages have allowed for user contribution to greatly proliferate and the widespread usage of social networking, blogging, and comment sections.

Cloud Computing

C.4.2 Describe how cloud computing is different from a client-server architecture.

Cloud computing is hosting on remote servers on the internet to store, manage, and process data rather than on a local server or personal computer. Cloud computing more widely shares resources than in the client-server architecture. Client-server architecture merely refers to the communication between client and server and the distribution of “responsibility”.

Public Computing

Anyone can access it.
No maintenance and updating up to the company.

Private Cloud Computing

A hosted data center where the data is protected by a firewall.
Great option for companies who have expensive data centers as they can use their current infrastructure.
However, maintenance and updating is up to the company.

Hybrid approach

When using both private and public cloud.

C.4.3 Discuss the effects of the use of cloud computing for specified organizations.

Effects of use of cloud computing for organizations

Less costly
Device and location independence
Maintenance is easier
Performance is easily monitored
Security is interesting

Intellectual Computing

C.4.4 Discuss the management of issues such as copyright and intellectual property on the web.

Creative Commons gives freedom to share, adapt, and even use commercially information. Has different redistributions and some may allow usage without crediting, but may not indicate it is their own intellectual property.

C.4.5 Describe the interrelationship between privacy, identification and authentication.

Privacy: information shared with visiting sites, how that information is used, who that information is shared with, or if that information is used to track users. ^[4]

Identification: the process of comparing a data sample against all of the systems databased reference templates in order to establish the identity of the person trying to gain access to the system. ^[5]

Authentication: a process in which the credentials provided are compared to those on file in a database of authorized users’ information on a local operating system or within an authentication server. ^[6]

These three components enable for safe and secure internet browsing.

C.4.6 Describe the role of network architecture, protocols and standards in the future development of the web.

Future Networks and Wireless Ad hoc Networks
Future Networks in Vehicular Ad Hoc Networks
5G and Internet of Things (IoT)
Future Internet applications in IoT
Steps towards Future of Smart Grid Communications
Routing in Machine to Machine (M2M) and Future Networks
Fusion of Future Networking Technologies and Big Data / Fog Computing
Future Internet and 5G architectural designs
5G advancements in VANETs (Vehicular Ad Hoc Network)
Mobile edge computing
Security and Privacy in future Networks
Networking Protocols for Future Networks
Data Forwarding in Future Networks
New Applications for Future Networks
Transport Layer advancements in Future Networks
Cloud based IoT architectures and use cases^[7]

Internet of Things (IoT)

IoT refers to the network of physical objects embedded with electronics and other needed technology to enable these objects to collect and exchange data. ^[8]

C.4.7 Explain why the web may be creating unregulated monopolies.

New multinational online oligarchies or monopolies may occur that are not restricted by one country.

Innovation can drop if there is a monopoly. There is therefore danger of one social networking site, search engine, browser creating a monopoly limiting innovation.
Tim Berners-Lee describes today’s social networks as centralized silos, which hold all user information in one place.^[9]
Web browsers (Microsoft)
Cloud computing is dominated by Microsoft.
Facebook is dominating social networking.
ISPs may favor some content over other.
Mobile phone operators blocking competitor sites.
Censorship of content.^[10]

Net Neutrality

A principle idea that Internet Service Providers (ISP) and governments should treat all data and resources on the Internet the same, without discrimination due to user, content, platform, or other characteristics.

C.4.8 Discuss the effects of a decentralized and democratic web. .

The term 'Decentralized Web' is being used to refer to a series of technologies that replace or augment current communication protocols, networks, and services and distribute them in a way that is robust against single-actor control or censorship.^[11]

Benefits	Issues
More control over data: possible improved privacy Making surveillance harder Avoid censorship Possibly faster speeds, e.g. BitTorrent	Barrier to usability: difficult for non-technical users to host their content Less practical sometimes DNS alternatives necessary for legible domain names: see BitTorrent links as an example Higher maintenance costs

^[9]

↑ ^a ^b ^c Dale, Nell, and John Lewis. Computer Science Illuminated. 5th ed. N.p.: Jones & Bartlett Learning, 2012. Print.
↑ International Baccalaureate Diploma Programe. Markscheme May 2015 Computer science Standard level Paper 2. N.p.: International Baccalaureate, May 2015. Print.
↑ ^a ^b ^c International Baccalaureate Diploma Programe. Markscheme November 2015 Computer science Standard level Paper 2. N.p.: International Baccalaureate, May 2015. Print.
↑ Computer hope."What is privacy?" Computer Hope. Computer Hope, n.d. Web. 12 Apr. 2017. [1]
↑ Webopedia. "Identification." What is identification? Webopedia Definition. Quinstreet Enterprise, n.d. Web. 12 Apr. 2017. <http://www.webopedia.com/TERM/I/identification.html>.[2]
↑ Rouse, Margaret. "What is authentication? - Definition from WhatIs.com." SearchSecurity. TechTarget, Feb. 2015. Web. 12 Apr. 2017. <http://searchsecurity.techtarget.com/definition/authentication>.[3]
↑ Pecht, Michael. "Future Networks: Architectures, Protocols, and Applications." IEEE Access. IEEE, 31 Jan. 2017. Web. 19 Apr. 2017. <http://ieeeaccess.ieee.org/special-sections-closed/future-networks-architectures-protocols-applications/>.
↑ "Internet of Things Global Standards Initiative". ITU. Retrieved 7 May 2016.
↑ ^a ^b CS-IB. "C.4 The evolving web." Cs-ib.net. CS-IB, n.d. Web. 19 Apr. 2017. <http://www.cs-ib.net/sections/C-04-the-evolving-web.html>.
↑ Web_science_option_c.docx. Huston: Weebly, n.d. PDF.
↑ Griffey, Jason. "What Is the Decentralized Web? 24 Experts Break it Down." Syracuse University School of Information Studies. Syracuse University School of Information Studies, 22 July 2016. Web. 19 Apr. 2017. <https://ischoolonline.syr.edu/blog/what-is-the-decentralized-web/>.

[:0-1] Dale, Nell, and John Lewis. Computer Science Illuminated. 5th ed. N.p.: Jones & Bartlett Learning, 2012. Print.

[2] International Baccalaureate Diploma Programe. Markscheme May 2015 Computer science Standard level Paper 2. N.p.: International Baccalaureate, May 2015. Print.

[:1-3] International Baccalaureate Diploma Programe. Markscheme November 2015 Computer science Standard level Paper 2. N.p.: International Baccalaureate, May 2015. Print.

[4] Computer hope."What is privacy?" Computer Hope. Computer Hope, n.d. Web. 12 Apr. 2017. [1]

[5] Webopedia. "Identification." What is identification? Webopedia Definition. Quinstreet Enterprise, n.d. Web. 12 Apr. 2017. <http://www.webopedia.com/TERM/I/identification.html>.[2]

[6] Rouse, Margaret. "What is authentication? - Definition from WhatIs.com." SearchSecurity. TechTarget, Feb. 2015. Web. 12 Apr. 2017. <http://searchsecurity.techtarget.com/definition/authentication>.[3]

[7] Pecht, Michael. "Future Networks: Architectures, Protocols, and Applications." IEEE Access. IEEE, 31 Jan. 2017. Web. 19 Apr. 2017. <http://ieeeaccess.ieee.org/special-sections-closed/future-networks-architectures-protocols-applications/>.

[8] "Internet of Things Global Standards Initiative". ITU. Retrieved 7 May 2016.

[:2-9] CS-IB. "C.4 The evolving web." Cs-ib.net. CS-IB, n.d. Web. 19 Apr. 2017. <http://www.cs-ib.net/sections/C-04-the-evolving-web.html>.

[10] Web_science_option_c.docx. Huston: Weebly, n.d. PDF.

[11] Griffey, Jason. "What Is the Decentralized Web? 24 Experts Break it Down." Syracuse University School of Information Studies. Syracuse University School of Information Studies, 22 July 2016. Web. 19 Apr. 2017. <https://ischoolonline.syr.edu/blog/what-is-the-decentralized-web/>.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]