Lentis/Software Journalism: When Programs Write the News

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Software Journalism is the use of computer programs to write news. These programs take in data to produce human-readable stories. This chapter describes basic information about Software Journalism and explores the social interaction of Software Journalism software, news producers and news consumers.

Background[edit]

Software Journalism, sometimes called Automated Journalism or Robot Journalism, is the use of computer programs to automatically generate textual narratives from structured data. It is heavily tied to automating reporting.

Algorithms[edit]

Software Journalism applications use algorithms that quickly create lots of stories for a given topic. These algorithms are best at writing stories for repetitive topics that have clean, accurate and structured data available. Organizations wanting to cut costs and produce more news use Software Journalism algorithms.

Scale and Speed[edit]

Software Journalism algorithms generate news faster and at a larger scale than human journalists are capable of doing. For example, Ken Schwneke developed Quakebot to automate earthquake reporting. The software detected a 4.4 magnitude earthquake using data from the United States Geological Survey. Quakebot published the story for the Los Angeles Times three minutes after the initial tremors, which was faster than all competing news outlets [1]. The LA Times later started a blog to give area residents homicide reports. The blog's software sifts through data from the coroner's office faster than human journalists and produces more in-depth reports than human journalists are capable of writing[2].

Personalization[edit]

Algorithms can use the same data to create many stories from different angles. Narrative Science used Software Journalism algorithms to produce various recaps on former University of Virginia pitcher Will Roberts' perfect game against George Washington University. Neutral and Pro-GW stories where produced to illustrate this effect. These differing summaries were produced from the game's box score[3].

Bias and Error[edit]

Software Journalism algorithms are not free from error and bias. A programmer's assumptions can cause algorithms to produce incorrect results. For example, the Associated Press automatically generated a story on Netflix's second quarter earnings using Wordsmith. The story incorrectly stated that Netflix's share price had fallen 71 percent in the year when it had actually doubled. This error occurred because Wordsmith's algorithm was unable to detect the seven-one share split in Netflix's financial data. Netflix's shares dropped because of this algorithmic error and a correction was later issued by the AP[4]. Outliers, biased data and programmer bias can produce incorrect stories that may require critical corrections [5]. Algorithms also are limited in making analytical insights; they can't ask questions or explain phenomena[6].

NLG vs NLP[edit]

Software Journalism programs use Natural Language Generation (NLG) to create content. NLG is the counterpart of Natural Language Processing (NLP). NLP converts text into structured data, whereas NLG generates contextual narratives from data. Both NLP and NLG are fields of Artificial Intelligence [7].

Why Use Software Journalism?[edit]

Software Journalism allows content producers to quickly identify facts that are important to the narrative through seamless data source integration. A story can be generated automatically with any size dataset. The narrative produced is almost undistinguishable from one written by a human journalist and can be personalized. Employee productivity also increases, since employees no longer need to do manual reporting and time consuming data analysis. This allows employees to focus on personal growth and higher-level content generation. All of these benefits allow content producing organizations to generate more narratives while cutting production costs[8].

How to Use Software Journalism[edit]

Data is needed before content can be generated. Software can use pre-defined sources or data mine text to fulfill this requirement. Clean and accurate data must be used, otherwise errors may occur. Algorithms employ statistical methods on available data to identify interesting events. Users typically give algorithms identification rules for finding such events. These rules are also used to prioritize an event's insightfulness. Story and style templates are used to generate a narrative from the most newsworthy events. Content publishers can review automatically generated stories before releasing them to the public[9].

Use Cases[edit]

Software Journalism has a variety of uses beyond writing the news. These include:

  • E-Commerce: Companies can use manufacturer data to create targeted and compelling product descriptions for customers [10].
  • Media: The AP uses Software Journalism to produce 3700 earnings stories per quarter; this is 12 times the amount produced by manual reporting[11].
  • Financial Services: Banks and investment management companies rely on automated content generation to create portfolio summaries, earnings recaps and market reports [12].
  • Real Estate: Property descriptions, market trends and neighborhood summaries can be crafted from real estate data[13].
  • Customer Engagement: The Orlando Magic automatically generates content to engage season ticket holders who are reselling seats[14].

Participants[edit]

NLG Companies[edit]

The software from a few companies produces the vast majority of today's NLG content. The biggest players in the industry are Automated Insights and Narrative Science. Automated Insights makes Wordsmith and Narrative Science develops Quill. Automated Insights generated 300 million articles in 2013 and generated 1 billion articles in 2014. This amounts to more than all combined content produced by major media outlets in 2013. [15]. Automated Insights released a public version of Wordsmith in 2015.

Amount of Content Generated[edit]

The AP automatically generates at least 3000 earnings reports per quarter through Wordsmith. The AP also uses Wordsmith to recap over 9000 Minor League Baseball games each year[16]. Automated Insights estimates that Wordsmith generated 1.5 billion articles in 2015 alone[17] and that Wordsmith can create 2000 articles per second if the need arises [18].

Companies Using Software Journalism[edit]

A wide variety of companies use NLG software.

Software Journalism is used to scale up content production; more content can be made at a lower cost using Software Journalism. Companies value cutting cost while creating more media faster to engage a specific audience.

Media Consumers[edit]

Media consumers are voluntary and fact checking. They seek out information willfully and can stop consuming media at any time. These individuals desire accurate content from a credible source. Content can be consumed for personal gain and enjoyment.

Reader Perception of Automatically Generated Content[edit]

A 2014 study investigated how readers perceive software-generated content in relation to content written by a human. Participants in this study were shown various texts and asked to rank them based on criteria like objectivity, clarity and credibility. Text claimed to have been written by a human journalist scored higher on coherence, readability and clarity while software-generated text scored higher on accuracy, information delivered, descriptiveness, trustworthiness and objectivity. The study states that these observed differences are not statistically significant. Therefore, readers perceive software-generated text in the same light as text written by humans [23].

Another study examined how readers perceived content differences using articles written on the same topic by computer and humans. The study's results show that articles declared as human-written were ranked more favorably regardless of the actual author type. Similarly, articles deemed computer-written were ranked less favorably. This study contends that a reader's preconceived notions about computer generated content will affect the content's perceived quality. In other words, readers are not able to discern actual differences between computer-written and human-written content [24].

Social Implications[edit]

Credibility of Information[edit]

It can be very hard for an audience to determine whether an article was written by a human or an algorithm. Articles are not usually marked. An online quiz generated by the New York Times revealed that readers could determine an article's source roughly 50% of the time[25].

Articles can be subject to error regardless of author type. NLG software is subject to three main sources of error:

  • Error propagation due to bias of the NLG software developer.
  • Error in data entered into the software template.
  • Error due to data stream corruption ( hacking).

These errors can reach the public more often than human-made errors as articles are published faster than human quality control can handle. There are instances were Software Journalism has produced critical errors in content. On July 23rd 2014 at 9:50 AM EST, the AP tweeted,“Breaking: Dutch military plane carrying the bodies of the Malaysia airlines flight 17 crash lands in Eindhoven.” Nine minutes later, the AP issued a retraction reading, “Clarifies: Dutch military plane carrying the bodies of the Malaysia airlines flight 17 lands in Eindhoven.” 3818 users had "retweeted" the false information in the nine minutes it took the AP to broadcast a clarification[26]. On October 6th 2015, the AP's managing editor, Lou Ferrara, stated that the false tweet was "unintended especially on such a horrible situation" in an interview conducted by Hasan Minhaj, a senior correspondent of The Daily Show[27]. This error was accredited to Software Journalism's inability to properly deconstruct information. On March 16th 2015, The AP published an article stating that Robert Durst had been arrested on weapons charges in Louisiana and a first degree murder charge in Los Angeles. The article correctly identified Robert Durst as the individual with criminal charges. However, the article used the description of Fred Durst, the lead singer of the band Limp Bizkit. One day after the incorrect publication, the AP released the following statement; “The Associated Press reported erroneously that Robert Durst is a member of a band. He is a real estate heir; Fred Durst is the former frontman of Limp Bizkit”[28]. These examples show how algorithmic error can cause a rapid spread of false information.

Thomas Theorem & Perpetuation of Iron Triangles in Industry[edit]

Iron Triangle

The Thomas Theorem states that presenting false information causes unwarranted actions. NLG software’s high publication rate can be used by groups to drive a false narrative or spread biased information. Kristian Hammond, a Narrative Science cofounder, estimated that more than 90% of news stories will be written by software by the year 2027[29]. Spreading false or biased information allows NLG software to drown out opposing viewpoints. Therefore, this software can be a tool groups use to perpetuate an iron triangle. An Iron Triangle is a self-reinforcing social power structure. The common iron triangle is formed between government, interest groups, and bureaucracy. It has the power to shape public opinion, elect sympathizing officials, and control focus of research to enhance the standing of a private interest group. Software Journalism can be used to perpetuate this power structure by enabling the rapid dissemination of massive quantities of data and news. Combined with the pervasive nature of today's media, Software Journalism can endlessly bombard media consumers with huge amounts of one-sided viewpoints, statistics or ideologies. This huge amount of material can mimic a public consensus to media consumers. In reality, this media may be an artificially amplified view produced by NLG software. For example, a campaign could use Software Journalism to misinform the electorate about rival candidates. Pervasive, biased information can sway public opinion and cause a qualified candidate to lose an election.

Generalizations and Future Research[edit]

Generalizations[edit]

Like any technology, the possibilities of automated journalism afford its users new power. As discussed, the ability to spread vast quantities of media content can heavily influence public opinion. NLG software has no inherent ulterior motive. In the wrong hands, it could spread false information, propaganda, or anything else its users desire. Information can be used to educate and inform the public, but can also be used to control, direct, or mislead them. This directly relates to a technology's latent and manifest functions.

NLG software is not robust to error. A specialized technology, like that used in Software Journalism, is not versatile. Such technology cannot correct errors in an efficient manner without human intervention. This phenomena typically appears with automation technology. For example, mechanized assembly lines can't fix all errors and require human quality control. Therefore, human oversight will always be needed with automation.

Future Research[edit]

Future researchers may look at how Software Journalism impacts non-textual media like advertising since journalism doesn't just involve textual narratives. Another area of research would be the human journalists' reaction to and displacement from Software Journalism. Software Journalism has changed how news is produced. Therefore, it would be worthwhile to explore a human journalist's changed role. Automation is a big part of Software Journalism. Researchers could explore the historical perception of automation to better understand Software Journalism's social interface.

References[edit]

  1. Pluscina, J. (2014, March 18). How an algorithm helped the LAT scoop Monday's quake. http://www.cjr.org/united_states_project/how_an_algorithm_helped_the_lat_scoop_mondays_quake.php
  2. The Los Angeles Times (2016). Frequently Asked Questions. http://homicide.latimes.com/about/
  3. Petchesky, B. (2011, March 30). We Heard From The Robot, And It Wrote A Better Story About That Perfect Game. http://deadspin.com/5787397/we-heard-from-the-robot-and-it-wrote-a-better-story-about-that-perfect-game
  4. Associated Press. (2015, July 15). Netflix misses Street 2Q forecasts. http://finance.yahoo.com/news/netflix-misses-street-2q-forecasts-202216117.html
  5. Diakopoulos, N. (2016). Accountability in Algorithmic Decision Making: A View from Computational Journalism. Communications of the ACM. http://towcenter.org/wp-content/uploads/2014/02/78524_Tow-Center-Report-WEB-1.pdf
  6. Graeffe, A. (2016, January 7). Guide to Automated Journalism. http://towcenter.org/research/guide-to-automated-journalism/
  7. Wright, A. doi:10.1145/2820421
  8. Narrative Science. (2016). Quill. https://www.narrativescience.com/quill
  9. Automated Insights. (2016). The Complete Getting Started Guide. https://wordsmithhelp.readme.io/docs/getting-started
  10. Automated Insights. (2016, July). Automating E-Commerce Content Creation. http://go.automatedinsights.com/rs/671-OLN-225/images/E-Commerce-Whitepaper-Ai.pdf
  11. Automated Insights. (2016). The Associated Press Leaps Forward. https://automatedinsights.com/associated-press-leaps-forward
  12. Automated Insights. (2016). Wordsmith Use Cases. https://automatedinsights.com/use-cases
  13. Automated Insights. (2016). Wordsmith Use Cases. https://automatedinsights.com/use-cases
  14. Automated Insights. (2016). Customer Data Makes Orlando Magic. https://automatedinsights.com/orlando-magic-case-study
  15. Automated Insights. (2016). Automated Insights. https://automatedinsights.com/
  16. Kotecki, J. (2016, August 15). Just How Good Can Wordsmith Content Really Be?. https://automatedinsights.com/blog/just-good-can-wordsmith-content-really
  17. Automated Insights. (2016). Automated Insights. https://automatedinsights.com/
  18. Miller, R. (2015, January 29). AP's 'robot journalists' are writing their own stories now. http://www.theverge.com/2015/1/29/7939067/ap-journalism-automation-robots-financial-reporting
  19. Automated Insights. (2016). Automated Insights. https://automatedinsights.com/
  20. Automated Insights. (2016). Bodybuilding.com's Automated Trainer. https://automatedinsights.com/bodybuilding-com-case-study
  21. Automated Insights. (2016). Customer Data Makes Orlando Magic. https://automatedinsights.com/orlando-magic-case-study
  22. Narrative Science (2016). Narrative Science. https://www.narrativescience.com/
  23. Clerwall, C. (2014, February 24). Enter the Robot Journalist: Users' perception of automated content. Journalism Practice, 8(5), 519 - 531.
  24. Graefe, A., Haim, H., Haarman, B., & Brosius, H. (2016, April 17). Perception of Automated Computer-Generated News: Credibility, Expertise, and Readability. doi:10.1177/1464884916641269
  25. New York Times. (2015 March 8th). Did a human or computer write this?,http://www.nytimes.com/interactive/2015/03/08/opinion/sunday/algorithm-human-quiz.htm
  26. RT News. (23 July 2014). Tweet Gone Wrong, https://www.rt.com/usa/175056-twitter-ap-mh17-victims/
  27. Aaron Sorkin. (2015 October 6). Robot Journalists. http://www.cc.com/video-clips/fh76l0/the-daily-show-with-trevor-noah-robot-journalists
  28. Sasha Goldstein. (2015 March 17). Accused killer Robert Durst misidentified in AP story as ‘former Limp Bizkit frontman’ Fred Durst, http://www.nydailynews.com/news/national/robert-durst-mixed-story-fred-durst-limp-bizkit-article-1.2152410
  29. Steven Levy. (2012 April 4). Can an algorithim write a better news story than a human reporter?, https://www.wired.com/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/