Re: [CollEc] RePEc Visual

21 May 2020

      I checked how fast CollEc computations run when executed in C and C++ through an R package. The underlying graph contained 47,192 authors who wrote at least one co-authored paper. I weighted the edges between co-authors by their number of joint papers.

First, I calculated the distance matrix. Distances are measured as the length of Dijkstra's shortest cost paths. Calculating and writing those 2,227,084,864 cell values to disk took 4.77 minutes in a process parallelized across 8 cores. Computing each author's closeness value and writing it to disk took 4.27 minutes in an 8 core process. Betweenness is quite slow in comparison.

The code still leaves space for improvement. All three measures are derived from shortest cost paths. So, it would be more efficient to derive those paths once and use them for all three measures rather than computing them thrice. Another point is the parallel process structure. The iterations' chunk size may not be optimal and could be improved through further tests.

If users only access data for a small number of authors at once, it is not even necessary to previously calculate those values and store them on disk or on a SQL database server. With the graph kept in memory computations are quasi instant for small sets of authors.

See you tomorrow.

Christian Düben
Research Associate
Chair of Macroeconomics
Hamburg University
Von-Melle-Park 5, Room 3102
20146 Hamburg
Germany
+49 40 42838 1898
christian.dueben@uni-hamburg.de
http://www.christian-dueben.com

-----Ursprüngliche Nachricht-----
Von: Thomas Krichel <krichel@openlib.org> 
Gesendet: Mittwoch, 20. Mai 2020 14:14
An: Düben, Christian <Christian.Dueben@uni-hamburg.de>
Cc: CollEc Run <collec-run@lists.openlib.org>
Betreff: Re: RePEc Visual

  Düben, Christian writes
...
I went through some of the files and checked what I would need for an 
extension of CollEc. I have a few ideas in mind on what to add and how 
to present it in an interactive application.
It's very hard to do a worse job than I did vizualizing that
  data!
...
When consulting our IT department here at Hamburg University, they 
suggested to host RePEc Visual on one of their managed Linux servers. 
At this point I am still waiting for the administration to process my 
application requesting such a server. And just like every 
administrative procedure at our institution, this takes a while. Once 
I have access to the respective infrastructure I am going to test 
implementations of RePEc Visual and potential CollEc extensions on it. 
Those applications would of course run under an external domain, not a 
Hamburg University domain.
We could run this on the existing CollEc server. This would
  be especially valuable if you manage to find a way to run the
  calculations faster. At this time, it's dreadfully slow. You could
  just take over the whole thing, well almost. We need to keep the
  mention of the sponsor, and I'd like to be aknowledged as the
  orginal creator.
...
I do not have Telegram and apparently do not have the correct login 
credentials for the Skype setup on my office Laptop. Do you use Zoom? 
If you do, I can send you a meeting link. If you do not, I will try to 
find out what login credentials our IT set for Skype.
Zoom should be fine. I'm in UTC+7. I can do late evenings no
  problem. My schedule is completely open. Maybe someone else would
  want to attend? I copy CollEc-run.

-- 

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel