UTwente PeoplePages Contact Book

Show blog post de­tails
Posted
Tags
Code
Hide blog post de­tails

As a fu­ture cre­ative tech­nol­ogy stu­dent at the University of Twente, I wanted to get in touch with a par­tic­u­lar pro­fes­sor. Their web­site, PeoplePages, uses a RESTful API for AJAX re­quests to search for uni­ver­sity staff, so I de­cided to add every­one to my con­tacts us­ing API scrap­ing to save time in the fu­ture.

I did a query to find all re­sults start­ing with the let­ter a” and got a mini­fied JSON re­sponse with all data. Fortunately, they have un­re­stricted ac­cess to their end­points. This is what it looks like when cleaned:

GET https://people.utwente.nl/search?query=a
{
  "data": [
    {
      "type": "person",
      "id": "10000000000XXXX",
      "name": "John Doe",
      "jobtitle": "Supporting Staff",
      "avatar": "https://people.utwente.nl/john.doe/picture.jpg",
      "profile": "https://people.utwente.nl/john.doe",
      "organizations": [
        {
          "code": "S&B-XXXX",
          "department": "S&B",
          "section": "XXXX"
        }
      ],
      "locations": [
        {
          "description": "Enschede 320",
          "latitude": 52.23979,
          "longitude": 6.850018
        }
      ],
      "phones": [
        {
          "type": "",
          "tel": "+3153489XXXX",
          "prefix": "+3153489",
          "ext": "XXXX"
        }
      ],
      "email": "john.doe@utwente.nl"
    }
  ]
}

and so on. Since empty searches, space searches, and oth­ers weren’t work­ing, I de­cided to query each let­ter of the al­pha­bet and save the JSON re­sult to play with it:

wget https://people.utwente.nl/search?query={a..z}

I soon re­al­ized that this would­n’t work be­cause the API re­stricts the num­ber of re­sults to 50, but this would:

wget https://people.utwente.nl/search?query={a..z}{a..z}

This goes through every com­bi­na­tion in the al­pha­bet: aa, ab, ac . . . zx, zy, zz, and down­loads the JSON file. This was enough, but in many com­bi­na­tions like xx, xz, etc., there were no re­sults, so the empty JSON file was ex­actly 43 bytes with just the JSON struc­ture. I then got rid of those files:

find . -name "*" -size 43c -delete

This Bash com­mand finds all files that are of 43 bytes in size and deletes them. Note that if I just fil­ter the size in bytes and query some­thing like -size 43 -delete, it in­ter­prets it as 43*512 bytes, so the POSIX re­quire­ment states c” for bytes.

Finally I con­cate­nated all the JSON files to one gi­ant 4.9 MB file.

cat * > contacts.json

After clean­ing the file, re­mov­ing busi­ness con­tacts, and gen­er­ally play­ing with the JSON con­tent, I have a di­rec­tory of 7527 peo­ple in­clud­ing du­pli­cates. Sublime Text can han­dle this for me, with the sim­ple com­mand: Edit -> Permute Lines -> Unique. I now have 3740 peo­ple.

Then, I clean up by re­plac­ing dou­ble space with sin­gle, change the Surname, Firstname” for­mat to Firstname Surname”, and saved the con­tacts in a CSV file.

I now have the phone num­bers, email ad­dresses, and of­fice ad­dresses of all my pro­fes­sors, the Dean, and other im­por­tant con­tacts for the uni­ver­sity in my phone’s con­tact book. Simple enough.

Security #

The sim­ple way to pre­vent this is to have se­cured API end­points. There are many ways to do that — to­ken-based au­then­ti­ca­tion for each user with rate lim­it­ing, or even CORS pre­ven­tion.

Screenshot of Google Search results for 'secure api endpoints'

Update May 2018: Since then, the uni­ver­sity has up­dated their end­points. You can­not ac­cess them by bare­bones HTTP now. The new end­points also seem to have some form of CORS pro­tec­tion.