Docsity
Docsity

Prepara tus exámenes
Prepara tus exámenes

Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity


Consigue puntos base para descargar
Consigue puntos base para descargar

Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium


Orientación Universidad
Orientación Universidad

Web technologies to improve historical research, Apuntes de Contabilidad Financiera

Asignatura: humanidades, Profesor: , Carrera: Finanzas y Contabilidad, Universidad: UC3M

Tipo: Apuntes

2017/2018

Subido el 24/05/2018

felicity11
felicity11 🇪🇸

1

(1)

6 documentos

1 / 39

Toggle sidebar

Documentos relacionados


Vista previa parcial del texto

¡Descarga Web technologies to improve historical research y más Apuntes en PDF de Contabilidad Financiera solo en Docsity! Web technologies to improve historical research 1st Session HUMANITIES COURSES 2nd ed. 2017/18 Ask questions • Items from Europeana, the CU WW1 Collection and Out of the Trenches relating to events that happened in West Flanders • Population change in Belgian provinces during the war years as compared to the number of atrocities as well as total events that occurred there 2 These questions were suggested to show the usefulness of the Operation War Diary and Out of the trenches for the WW1 centenary commemoration (http://www.ldf.fi/project.html) Who was Francisco Sanchez “el escéptico”? (http://www.larramendi.es/francisco_sanchez/es/micrositios/inicio.do) Digital Humanities Life Cycle A similar trend could be seen in Digital Humanities (DH) 2. Acquisition (OCR...) Recording, extraction 3. Cleaning 5. Aggregation, (i.e. KOS) 6. Analysis & Interpretation 7. Publication 1. Objectives & Planning 4. Enrichment (merging and LOD) Humanities Life Cycle • Acquisition (Creation), digitization of documents, scraping documents or external datasets • Cleaning, avoid misspellings and errors. • Enrichment, data are augmented with metadata and external datasets (coherence with the historic period and context) • Editing, historians add annotations, bibliographic references and links • Retrieval, where information is selected, looked up, and used (e.g. via SQL or XPath). • Analysis, qualitative and quantitative analysis • Publication, historical information is communicated through multiple forms of presentation (e.g. text editions, online databases, virtual exhibitions, visualizations) A well-known cycle for the humanities, but essentially the same (just slight differences) Web technologies to improve historical research 1.2. Acquisition AN HUMANITIES COURSE 1st ed. 2016/17 Data formats. Serializations Formats Example Spreadsheets (Excel, …) TSV (tab separated values) NAME NACIONALITY WEIGHT Alan Spanish 55 John French 129 CSV (comma separated values) NAME, NACIONALITY,WEIGHT Alan, Spanish, 55 John, French, 129 XML <example> <person ID=“1”> <name> Alan </name> <nationality> Spanish</nationality><weight>55</weight></person > <person ID=“2”><name>John<….></person></example> Data formats. Serializations Formats Example JSON {“example":[ {“name":“Alan”, “nacionality”:“Spanish“, “phone“:[“work_ph”:”25255”,”cell_ph”:”45433”] , “weight”:51}, {other_record} ]} JSON- based BSON (Binary JSON) is a more efficient format than JSON. BSON includes data types (string, Integer, double, date, array or boolean), document size and field length in large elements. Other serializations based on JSON are: HOCON, Candle , Smile or Yaml YAML Data: given: Alan nacionality: Spanish weight: 51.5 age: 26 Phone: - Work: 25255 Address: 8 St.Paul Av. Quebec - cellular: 45433 Data acquisition. Bias Boston’s Street Bump • Phone’s accelerometer to detect potholes without the need for city workers to patrol the streets Acquisition tools • The most well-known tools for collecting data from Internet are the so-called spiders (crawlers) • A free-tool is Wget (GNU license), together with an easy to use solution is Powergrep (http://wget-how-to-by- joebeng.blogspot.com.es/2011/09/wget-guickstart- tutorial-for-windows.html) • Try import tables from the web with Google Spreadsheets. ▫ In a cell insert the next function =IMPORTHTML(URL, query, index) Example: =IMPORTHTML("https://en.wikipedia.org/wiki/List_of_me tropolitan_areas_by_population","table",1) 24 Data acquistion: Web Scraping 26 • Web Scraper, Web Harvesting or Web Data Extraction • Tool to extract structured data from websites with unstructured data • Process ▫ Visit the page/site ▫ Select the data you want to download ▫ Get the data with the tool • Ej: camelcamelcamel price oscilation in Amazon Web Scraping: tools • Easy to use, little or none knowledge in programming ▫ Import.io (https://www.import.io) ▫ Octoparse: (http://www.octoparse.com ▫ Screen Scraper (www.screen-scraper.com) ▫ Mozenda (http://www.mozenda.com/) ▫ Web Scraper (http://webscraper.io/) ▫ ParseHub (https://www.parsehub.com/) ▫ Portia (https://scrapinghub.com/portia/) ▫ DataScraping.co (https://www.datascraping.co/) • Browser plug-ins ▫ Scraper ▫ Web Scraper ▫ Extracty • Frameworks , programming capabilities ▫ Scrapy (https://scrapy.org/) Python . ▫ Jsoup (https://jsoup.org/) Java Data Acquisition. Sparql on DBpedia • Community effort to extract structured information from Wikipedia (infoboxes) • It is interlinked with many datasets, like geonames • By SKOS vocabulary and OWL (sameAs) elements different datasets are merged. Different groups could link datasets in a different way. • It uses three schemata to classify concepts: Wikipedia categories, Yago classification and Wordnet Data hubs, datasets and SPARQL • A dataset is a collection of related data (like a table in a RDB), published and maintained by a single provider and produced in the same context and formatted in a similar way. • Datasets are stored in repositories called Data Hubs (i.e. CKAN). A data hub shares different, and independent datasets in distinct formats (csv, xls, db, rdf, …). • Datasets are often described with metadata and RDF. • SPARQL provides a way to query a datasets modeled in RDF by its interface. This interfaces are called SPARQL Endpoints. A query in SPARQL • Examine this document in RDF ▫ https://cs.uwaterloo.ca/~ohartig/files/eswc2012.rd f ▫ Now copy the SPARQL query and paste in http://www.sparql.org/sparql.html. Set URI with the link above. PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?book ?title WHERE { ?book dc:title ?title } SPARQL Query Language for RDF W3C Recommendation 15 January 2008 Retrieving data from Dbpedia SELECT ?entity WHERE { ?entity rdf:type dbo:Country. ?entity dct:subject dbc:Member_states_of_the_United_Nations. } Most of these countries don’t exist any more, if you want to show only countries that exist nowadays, you can add that belongs to the United Nations (some countries like Yugoslavia could be on the list) TRIPLETS: <?ENTITY>< rdf:type ><dbo:country> <?ENTITY><dct:subject >< dbc:Member_states_of_the_United_Nations > QUERY Retrieving data from Dbpedia Now we want to show how many people lives in these countries and change the name entity to country SELECT ?entity AS ?COUNTRY ?population WHERE { ?entity rdf:type dbo:Country. ?entity dct:subject dbc:Member_states_of_the_United_Nations. ?entity dbo:populationTotal ?population } TRIPLETS: <?ENTITY>< rdf:type ><dbo:country> <?ENTITY><dct:subject >< dbc:Member_states_of_the_United_Nations > <?ENTITY>< dbo:populationTotal >< ?value> QUERY But Spain is not on the list , why? Retrieving data from Dbpedia Now we want to show how many people lives in these countries and change the name entity to country, including Spain SELECT ?entity AS ?COUNTRY ?population WHERE { ?entity rdf:type dbo:Country. ?entity dct:subject dbc:Member_states_of_the_United_Nations. { ?entity dbo:populationTotal ?population} UNION { ?entity dbp:populationCensus ?population} } TRIPLETS: <?ENTITY>< rdf:type ><dbo:country> <?ENTITY><dct:subject >< dbc:Member_states_of_the_United_Nations > <?ENTITY>< dbo:populationTotal >< ?value> OR <?ENTITY>< dbp:populationCensus >< ?value> QUERY If we Google “dbpedia Spain” we can see that this information is under the property dbp:populationCensus, not in dbo:populationTotal SELECT ?data1 ?data2 from <uri> WHERE {tripletRDF. FILTER filter} ORDER BY data LIMIT number Mandatoy. Variables to be shown. If you want to show all of them write * (here names are invented, but try to give them a clear semantics, it is shown in the column header). Triplets show how statements are linked between them. After each triplet a period, but if the same entity type a semicolon. All the variables must appear in these triplets Here, there are namespaces and qnames that you’re going to use (there are some by default) Order results in ascending (ASC) or descending (DESC) order by (one or more variables) Number of solutions returned After each triplet we could limit the result set with a filter. For instance, ?x> 5000 PREFIX abbreviation: <http:// namespace> PREFIX ... optional Only when the sparql endpoint accepts different datasets Select syntax SELECT, modifiers and aggregate functions SELECT DISTINCT * WHERE {… No duplícate is shown SELECT MAX(?x) as ?nuevo WHERE {… Select the maximum value(MAX) or the minimum (MIN) from ?x SELECT (COUNT (?x) As ?cuenta) WHERE {… Returns the number of times a triple appears SELECT (SUM (?x) As ?cuenta) WHERE {… Sum numbers in this variable SELECT (AVG (?x) As ?cuenta) WHERE {… Returns average of variable x SELECT …. WHERE { …} GROUP BY ?X When is used an aggregate function (count, sum, avg, min, max) is necessary to add “group by “ with the non-grouped-fields at the end Where WHERE { ?X rdf:type ?Y} Equivalent to WHERE { ?X a ?Y} {?X dc:title ?A} UNION {?X dc:creator ?B}. Show ?X with the title and others ?X with the author, WHERE {?X rdf:type ?Y. OPTIONAL {?X foaf:name ?Z}} Show the triple, even when {?X foaf:name ?Z} returns a empty set Query examples select ?king ?country where { ?country dct:subject dbc:Member_states_of_the_United_Nations. ?country dbo:leaderTitle ?king. FILTER (regex(?king,'^monarch','i')) } Monarchies in the United Nations (copy in http://dbpedia.org/sparql) It is advisable to look up the resource, to know the model and the metadata used .http://dbpedia.org/page/Spain Query examples select ?ruler, count(?country) as ?aggregated where { ?country dct:subject dbc:Member_states_of_the_United_Nations. ?country dbo:leaderTitle ?ruler. } group by (?ruler) order by DESC (?aggregated) The following query shows the count of types of head of the state in the UNO Output http://dbpedia.org/sparql Exercise 22/03/2018 Are you able to show by DBPedia queries any indication that shows that a system of government based on a single party is more harmful to the majority of citizens than one based on a constitutional parliamentary government?
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved